PCT 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Bureau 



* INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 7 : 
C12Q 1/68 



A2 



(11) International Publication Number: WO 00/65088 

(43) International Publication Date: 2 November 2000 (02. 1 1 .00) 



(21) International Application Number: PCT/EP00/03636 

(22) International Filing Date: 20 April 2000 (20.04.00) 



(30) Priority Data: 
99303215.0 



26 April 1999 (26.04.99) 



EP 



(71) Applicant (for all designated States except US): AMERSHAM 

PHARMACIA BIOTECH AB [SE/SE]; Bjorkgatan 30, 
S-751 84 Uppsala (SE). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): ULFENDAHL, Per-Johan 
[SE/SE]; Rapphonsvagen 10B, S-756 53 Uppsala (SE). 
WONG* Kin-Chun [SE/SE]; Ursviksvagen 2B, S-172 36 
Sundbyberg (SE). 

(74) Agent: ROLLINS, Anthony, John; Nycomed Amersham pic, 
Amersham Laboratories, White Lion Road, Amersham, 
Bucks HP7 9LL (GB). 



(81) Designated States: AE, AL, AM, AT, AU, AZ, BA, BB, BG, 
BR, BY, CA, CH, CN, CR, CU, CZ, DE, DK, DM, EE. 
ES, FI, GB, GD, GE, GH, GM, HR, HU, ID, IL, IN, IS, JP, 
KE, KG, KP, KR, KZ, LC, LK, LR, LS, LT, LU. LV, MA, 
MD, MG, MK, MN, MW, MX, NO, NZ, PL, PT, RO. RU, 
SD, SE, SG, SI. SK, SL, TJ, TM, TR, TT, TZ, UA, UG, 
US, UZ, VN, YU, ZA. ZW, ARIPO patent (GH, GM, KE, 
LS, MW, SD, SL, SZ, TZ, UG, ZW), Eurasian patent (AM, 
AZ, BY, KG, KZ, MD, RU, TJ, TM), European patent (AT, 
BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE, IT, LU, 
MC, NL, PT, SE), OAPI patent (BF, BJ, CF, CG, CI. CM, 
GA, GN, GW, ML, MR, NE, SN, TD, TG). 



Published 

Without international search report and to be republished 
upon receipt of thai report. 



(54) Title: PRIMERS FOR IDENTIFYING TYPING OR CLASSIFYING NUCLEIC ACIDS 



(57) Abstract 



A method is described for identifying a rather small set of extendible primers for use in the identification, typing or classification 
of a nucleic acid of known sequence having known polymorphisms. A matrix of primers and pairs of primer extensions is prepared and 
subjected to analysis by a set covering problem algorithm, e.g. a greedy algorithm or one which invloves a Lagrangian relaxation heuristic. 
Sets of primers are described for use in the identification, classification or typing of an organism, allele or gene selected from class 1 HLA 
class 2 HLA and 16S rRNA. 



77/E PURPOSES OF INFORMATION ONLY 
Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT. 



AL 


Albania 


ES 


Spain 


LS 


Lesotho 


SI 


Slovenia 


AM 


Armenia 


FI 


Finland 


LT 


Lithuania 


SK 


Slovakia 


AT 


Austria 


FR 


France 


LU 


Luxembourg 


SN 


Senegal 


AU 


Australia 


GA 


Gabon 


LV 


Latvia 


sz 


Swaziland 


AZ 


Azerbaijan 


GB 


United Kingdom 


MC 


Monaco 


TD 


Chad 


BA 


Bosnia and Herzegovina 


GE 


Georgia 


MD 


Republic of Moldova 


TG 


Togo 


BB 


Barbados 


GH 


Ghana 


MG 


Madagascar 


TJ 


Tajikistan 


BE 


Belgium 


GN 


Guinea 


MK 


The former Yugoslav 


TM 


Turkmenistan 


BP 


Burkina Faso 


GR 


Greece 




Republic of Macedonia 


TR 


Turkey 


BG 


Bulgaria 


HU 


Hungary 


ML 


Mali 


TT 


Trinidad and Tobago 


BJ 


Benin 


IE 


Ireland 


MN 


Mongolia 


UA 


Ukraine 


BR 


Brazil 


IL 


Israel 


MR 


Mauritania 


UG 


Uganda 


BY 


Belarus 


IS 


Iceland 


MW 


Malawi 


US 


United States of America 


CA 


Canada 


IT 


Italy 


MX 


Mexico 


uz 


Uzbekistan 


CF 


Central African Republic 


JP 


Japan 


NE 


Niger 


VN 


Viet Nam 


CC 


Congo 


KE 


Kenya 


NL 


Netherlands 


YU 


Yugoslavia 


CH 


Switzerland 


KG 


Kyrgyzstan 


NO 


Norway 


zw 


Zimbabwe 


a 


C6te d'lvoire 


KP 


Democratic People's 


NZ 


New Zealand 






CM 


Cameroon 




Republic of Korea 


PL 


Poland 






CN 


China 


KR 


Republic of Korea 


PT 


Portugal 






CU 


Cuba 


KZ 


Kazakscan 


RO 


Romania 






CZ 


Czech Republic 


LC 


Saint Lucia 


RU 


Russian Federation 






DE 


Germany 


U 


Liechtenstein 


SD 


Sudan 






DK 


Denmark 


LK 


Sri Lanka 


SE 


Sweden 






EE 


Estonia 


LR 


Liberia 


SG 


Singapore 







WO00/65088 PCT/EP00/03636 
* - v - 1 - 



PRIMERS FOR IDENTIFYING TYPING OR 
CLASSIFYING NUCLEIC ACIDS 

5 

DNA-sequence analysis is rapidly becoming a standard tool 
in modern, molecular biology research. Examples of applications include: 
Sequencing of unknown DNA-sequences, Identifying novel genes in 
stretches of sequenced DNA, Predicting protein-sequence and -structure 

10 from DNA-sequence alone and Identification of known gene-variations 
(sometimes called "typing a gene"). 

Typing of a gene could be crucial in some applications. For 
instance, organ-donation requires that the "immunological signature" of the 
donor matches that of the receiver. This "signature" is mediated by the 

15 Human Leucocyte Antigen (HLA) complexes (also known as Major 
Histocompatibility Complex, MHC) on the cell surface, and the 
corresponding genes are among the most varied in the human genome. 
Considering the importance of organ donation, the shortage of organ- 
donors and the fact that an organ cannot be stored for any longer time- 

20 periods, a rapid and accurate typing of the HLA-genes is required in order 
to make most use of the organs available for transplantations. 

Another application where a rapid and accurate identification 
of a gene is desired is when trying to identify unknown bacteria. A rapid 
identification of the bacteria causing the illness of a patient makes it 

25 possible to administer the correct medication early in the treatment of the 
disease, thus reducing the discomfort for the patient. Since every self- 
replicating organism so far studied use ribosomes when translating mRNA 
to proteins, analysis of one of the genes coding for the ribosome, for 
instance the 16S rRNA in the case of prokaryotes, could be used to identify 

30 the organism in question. 

There are several ways in which a gene can be identified, 
with the conceptually easiest being to sequence the entire gene and then 
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looking at the result. The main drawback is that this approach is time- 
consuming, and not easily scaled up using conventional methodology. A 
new method, Arrayed Primer Extension (APEX), lacks this drawback. 
APEX works by immobilising a large number of primers to a solid surface, 
5 thus creating a DNA-chip. These primers are constructed to be 

consecutively overlapping over the entire gene of interest, so that every 
base in the gene will have a primer to its 5'-end. By adding fluorescently 
labelled dideoxynucleotides, the primers will then be extended by one 
nucleotide using the sample DNA as template. It will thus be easy to check 
io which nucleotide was incorporated, which in turn tells you the entire 
sequence of the sample DNA. 

Since some genes, like the HLA and 16S rRNA, have a large 
number of known variations, a prohibitively large number of primers have to 
be created in order to probe for all possible combinations of variant 
15 positions in the gene. Thus the array primer extension method APEX for 
resequencing would need more than 16,000 primers if all DQB alleles 
would be sequenced from a 500 bp long PCR fragment. If all DQB alleles 
in pairs should be combined the number of primers might be even higher 
which would be the situation for a heterozygote found in most individuals. 

But this might not be necessary, if some variations always or 
never occur together. This needs to be studied though, and a way found to 
determine the least number of primers (and what their sequences are) 
required for unambiguously identifying those genes. 

An object of this invention is to find and implement an efficient 
25 algorithm capable of doing just that. The algorithm should preferably also 
take into account the melting points of the primers, so that the extension 
reaction can take place under optimal conditions for all of the primers on 
the chip. It should also minimise the number of "self-extended" primers, i.e. 
primers that can extend themselves without any sample DNA. This 
30 algorithm is then to be tested and evaluated on the HLA and 1 6S rRNA- 
genes. HLA is chosen partly because of the importance of rapid typing of 
these genes, leading to the fact that there are many other methods to 
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which APEX can be compared. It is also because the HLA-genes are 
"easy" to work with, since they rarely contain any insertions or deletions. 
These kinds of variations in the gene could potentially create problems 
when designing primers for APEX. The 16S rRNA, on the other hand, 

5 contains insertions and deletions and can thus be used to see if the 
algorithm can handle such variations. 

The invention provides a method of identifying a set of 
extendible primers for use in the identification, typing or classification of a 
nucleic acid of known sequence having known polymorphisms wherein: 

io j) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their corresponding extendible primers, 
jj) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
acid identified by at least one other primer. 

15 Preferably the method includes between step i) and ii): 

ia) potential extensions for each primer are identified with 
respect to each nucleotide sequence, 

ib) for each extendible primer the identified potential extensions 
are compared to determine which pairs of sequences can be discriminated 

20 by the primer. 

Preferably a matrix of primers and pairs of primer extensions 
is prepared in binary form and is subjected to analysis by a set covering 
problem (SCP) algorithm as described in more detail below. 

The invention also includes a set of extendible primers, for 
25 use in the identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms, identified by the method as 
defined. Preferably the primers are attached by 5'-ends to a surface of a 
support on which they are presented in the form of an array. 

In another aspect, the invention provides a set of extendible 
30 primers, for use in the identification, typing or classification of a human 
leucocyte antigen (HLA) gene as indicated, the set comprising about the 
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numberof primers indicated and being capable of distinguishing about the 
number of alleles indicated: 





HLA gene 


Number of 


Number of 




Alleles 


Primers 


Class 1 


HLA-A 


91 


172 




HLA-B 


200 


<1000 




HLA-C 


47 


94 


Class II 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 



5 in another aspect, the invention provides a set of extendible 

primers, for use in the identification, typing or classification of 16S rRNA, 
wherein the set comprises about 210 primers and is capable of 
distinguishing at least about 1207 different sequences. 

In these aspects of the invention, the approximate number of 

io primers is indicated. As indicated below, it may be possible by the use of 
the algorithms exemplified or other algorithms to generate slightly smaller 
sets of primers capable of distinguishing the number of alleles or 
sequences indicated, and these sets are envisaged according to the 
invention. Of course, other primers may be present in addition to those 

15 indicated as essential, and may be useful for checking purposes. The 
number of alleles or sequences indicated represents the approximate 
known number of polymorphisms or different sequences, and these will 

surely increase with time. 

In another aspect the invention provides a method of 
20 identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, by the use of the set of extendible primers 
as defined, which method comprises applying the nucleic acid or fragments 
thereof to the set of extendible primers under hybridisation conditions and 
effecting template-directed chain extension of extendible primers that have 



WO 00/65088 



-5- 



PCT/EP00/03636 



formed hybrids. Preferably template-directed chain extension is effected 
using four different fluorescently labelled chain-terminating nucleotide 
analogues, and results are analysed by an imaging system such as total 
internal reflection fluorescence (TIRF) or scanning confocal microscopy. 
5 The various steps of the method may be performed as described in the 
literature for the known APEX technique. 

In another aspect the invention provides a kit for use in the 
identification, typing or characterisation of a nucleic acid of known 
sequence having known polymorphisms, comprising the set of extendible 

10 primers as defined. 

In another aspect the invention provides an array of sets of 
extendible primers as defined, for the simultaneous identification, typing or 
classification of two or more different HLA genes. 

With the present invention it has been realised that where a 

is number of different alleles are to be identified, the total number of primers 
required to distinguish each of the alleles could be reduced as some 
primers would be common to all of the alleles, for example. Thus, with the 
present invention complete sets of primers for identification of each allele 
are identified and then the total number of primers in the combined sets is 

20 reduced using predetermined rules. 

Furthermore the present invention is based on the premise 
that as the primers are used to identify the presence or absence of a 
particular nucleotide sequence in any allele, the specific nucleotide that 
extends any particular primer is of less relevance than simply whether the 

25 primer has been extended. Thus, the problem of reducing the overall 
number of primers is greatly simplified rendering the problem one suitable 
for treatment as a Set Covering Problem (SCP). 

Embodiments of the present invention will now be described 
by way of example with reference to the accompanying drawings and 

30 examples, in which: 

Figure 1 is a diagram of a signal matrix in accordance with 

the present invention; 
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Figure 2 is a diagram of the corresponding binary matrix for 

the signal matrix of Figure 1 ; 

Figure 3 is a flow diagram of the steps for reducing the primer 
set in accordance with the present invention. 

5 The following is an explanation to assist in an understanding 

of the principles underlying the manner in which the number of primers 
used in the identification of a plurality of sequences may be reduced. 

Theoretically the number of primers required to identify k 
sequences grows as O(M), where / is the length of the sequences as each 

10 sequence requires / primers. However, the less the sequences differ from 
one another, the fewer primers are required as many of the primers 
required for identification of a first sequence may also be of use in 
identification of another sequence. This effect becomes more pronounced 
the greater the number of sequences to be identified and the greater the 

15 similarities. 

Considering an initial set of n primers required in the 
identification of k sequences, a signal matrix of k x n can be constructed. 
Each element in the matrix represents the signal, if any, that is generated 
by a particular primer with respect to a particular sequence. The signal will 

20 either be one of the four nucleotides 'A', 'C\ 'G', or T or no signal 
Figure 1 is an example of such a signal matrix where, for example, the 
signal generated by primer 2 with respect to sequence 3 is T. 

The signal matrix is then converted into a binary matrix that 
represents whether the signals for any particular primer differ with respect 

25 to different sequences. Thus, again with respect to primer 2, the same 
signal 'G' is generated for both sequences 1 and 2 but a different signal T 
is generated with respect to sequence 3. The binary matrix is constructed 
by considering each column (each primer) of the signal matrix and 
comparing each signal in that column in turn. Thus, as shown in Figure 2, 

30 the first row of the matrix represents a comparison of the signals for the first 
and second sequences, the second row represents a comparison of the 
signals for the first and third sequences and the third row represents a 
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comparison of the signals for the second and third sequences. Binary *0' 
represents the comparison revealing the same signal and binary '1' 
represents the comparison reveals different signals. In the case of primer 
2, as mentioned earlier the signals for the first and second sequences are 

5 the same ('0') whereas the signals for the first and third sequences are 
different (T). This conversion produces a matrix mxn where m=(k(k-1))/2. 
Hence, for large numbers of sequences, 2m grows approximately as the 
square of the number of sequences. Figure 2 shows the binary matrix for 
the signal matrix of Figure 1 . 

io As the primers are required to enable the differentiation of 

sequences from one another, the reduction of the signal matrix to a binary 
matrix, representing differences in the signals obtained for different 
sequences, distils that element of information necessary to enable a 
selection of the minimum number of primers necessary to identify the 

15 individual sequences. From the binary matrix the least number of columns 
are selected such that each row contains at least one non-zero element. 
Thus, if one of the columns contained all Ts only that one column would 
be required. However, in the case of Figure 2, there is no single column 
containing all Ts and so two columns must be selected, for example 

20 primers 1 and 2. Primers 1 and 2 together enable each of sequences 1 , 2 
and 3 to be differentiated and so the remaining primers are redundant. 

Where large numbers of sequences and primers are involved, 
the binary matrix renders the data contained within that matrix suitable for 
mathematical analysis. Once the selection of the reduced number of 

25 primers has been made, though, it is the signal matrix that is required 

during the use of the primers in the identification of the different sequences. 
Thus, the signal matrix is used to 'decode* the results of any analysis using 
the reduced number of primers. 

In practice, large numbers of sequences and primers are 

30 involved and the selection of a reduced set of primers cannot be performed 
by simple inspection of the binary matrix. For large numbers of primers, 
selection of a suitable reduced set of. primers can be performed by treating 
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the selection as a Set Covering Problem (SCP). An SCP is an integer 
optimisation problem and is well known in fields such as airline crew 
scheduling, selecting manufacturing equipment and ingot mould selection 
in steel production. In such large scale problems that cannot be solved 

5 exactly (NP-hard), heuristics are used in order to generate a solution. As a 
SCP is NP-hard, global algorithms and algorithms that identify local optima 
are not very suitable on their own for a large scale SCP. They will simply 
require far too much computation, as they try to find a solution that can be 
proven to be at least locally optimal. For this reason heuristic methods are 

10 required instead . They do not claim to give even locally optimal solution, 

but are much faster. 

Two known computational methods that have been found to 
be effective in identifying reduced sets of primers are the 'greedy' algorithm 
and Lagrangian relaxation algorithm. 

15 

Greedy Algorithm 

The most simple heuristic is the greedy algorithm, where 
columns are added one at a time. The column to be added in each step is 
chosen so as to cover as many uncovered rows as possible (a row is 
20 covered if it has at least one non-zero element). In other words, if S r is the 
set of columns already included in the solution at iteration r, and R r is the 
set of rows with no non-zero elements at iteration r, column ( is selected 
according to: 

MS*, 

j\ = arg mill c ; 7 P, j <t S r 
25 Equation 1 

This continues until all rows are covered, or until no more 
columns exist which can cover any of the rows still uncovered. Instead of 
minimising the term q / Pj, other terms can be used. Example terms are c,, 
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cj / log 2 Pj or C) / (Pj)2. Greedy algorithms of this type are described in "An 
Efficient Heuristic for Large Set Covering Problems", Vasko, Wilson, Naval 
Research Logistics Quarterly 1984, 31:163-171 the contents of which is 
incorporated herein by reference. The difference is in how much emphasis 
5 to place on the cost of the column versus how many rows the column 
covers. It is shown, however, that this entire class of heuristics share the 
same worst case behaviour. If we denote the set of columns in the solution 
as S and the solution value as Z, then the worst case behaviour can be 
described as: 

^opt 

10 

Equation 2 

where 

Z^CjXj 
Equation 3 

1 " in other words, how much worse the heuristic solution is 

compared to the optimal solution is dependent on the maximum number of 
non-zero elements in the columns. The advantage is that this algorithm is 
fast, even though its time complexity is 0{n?t\) (there can be a maximum of 

20 m columns in the solution, i.e. the maximum number of iterations is m. For 
each iteration the matrix is traversed once to find the next column to be 
added). Altogether, we have that the time required to solve the problem in 
the worst case scenario will grow as the number of sequences to the power 
of five (four due to the number of rows, and one due to the number of 

25 columns). In the case of 16S rRNA (see later), where we have -1000 
sequences, the matrix will have -500,000 rows. The number of primers 

i 
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(columns) is in this case -250,000. 

Laaranaian relaxation 

More sophisticated methods exist, which use other kinds of 

5 heuristics. One heuristic capable of generating the most optimal solutions 
is believed to be some kind of Lagrangian relaxation heuristic, where in 
each iteration the Lagrange multipliers for each column are used to 
calculate the Lagrangian cost for the columns. Such a Lagrangian 
relaxation heuristic is described in "A Heuristic Method for the Set Covering 

10 Problem", Capara et ai Technical Report OR-95-8, Operations Research 
Group, University of Bologna 1995 the content of which is incorporated 
herein by reference. A near optimal vector of these costs is then calculated 
by a subgradient algorithm, before being used as input to a greedy 
algorithm. This is repeated until no improvements in the solution can be 

15 made. 

In Lagrangian subgradient methods the Lagrangian of the 
original problem is considered instead of the original problem. In this case, 
the Lagrangian will be 

n m 

L{u) = min £ c y (u)Xj + £ u, 
j=\ i=i 




Equation 4 



where u/ is the Lagrangian multiplier for row /. q(u) is the 
Lagrangian cost associated with column j, and is defined by 

m 

;=i 

25 Equation 5 



t 
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10 



An optimal solution to Equation 4 is given by 



0ifc y (w)>0 
1 ifc y (w)<0 
0 or 1 ifcj(u) = 0 



Equation 6 

L(u) can also be seen as an estimate of the lower bound for 
the solution, i.e. the sum of the costs for the columns in the optimal solution 
to the SCP will be > L(u). The solution to the SCP can be found by finding 
an optimal multiplier vector V instead, but this will require much 
computation especially for a large SCP. But near-optimal multiplier vectors 
can be found within short time by using the subgradient vector s(u), defined 
by 

5,(") = 1- !>;("). I = 1 ~ w 
Equation 7 



15 



20 



u can be refined iteratively by using for example 



Equation 8 

where X > 0 is a step-size parameter and UB is an upper 
bound on the value of the solution. The initial u° can be defined arbitrarily. 
To solve the SCP, first a near-optimal multiplier vector u is found. This and 
Equation 6 is then used as a basis to form a feasible solution. The upper 
bound UB can then be updated to the value of this feasible solution (if it is 
better than the previous best solution), and a new near-optimal multiplier 
Vector found and so on until convergence is reached. 
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Another alternative computational method that may be 
employed to solve such a SCP is 'surrogate relaxation' in which in each 
iteration a corresponding continuous problem is solved and made feasible 
before a sub-gradient algorithm is applied. Alternatively, genetic algorithms 
5 may be employed in which the 'genome' consists of n bits, one bit for each 
of the columns. 

It should also be borne in mind that as the SCP operates on 
the binary matrix which only represents differences in signals between 
sequences for the same primer, a primer in the selected reduced set may 
10 generate a negative, '-', signal rather than a positive signal, A, C, G, T. To 
be sure that the sample does in fact contain a particular sequence it is 
essential to ensure that for each sequence at least one primer generates a 
positive signal. Furthermore, in practice redundancy is desirable as all 
reactions may not occur as intended. Therefore, the least number of 
15 positive signals as well as the least number of differences in the signal 
pattern is preferably larger than one. 

With reference to Figure 3, the following is a description of 
one method of selecting a reduced set of primers. 

Firstly, all possible primers are selected (10) using the 
20 standard APEX procedure to produce a first set of primers. During this 
selection a substring of the sequence to be analysed is used to construct 
one primer, then the substring is displaced by one base and another primer 
is constructed. This process is carried out from the start of the sequence 
until the entire sequence has been covered. Both strands of DNA are used 
25 and this is repeated for all sequences. The primers should be long enough 
to be capable of discriminating between exact matches and mismatches 
involving one or two nucleotide pairs. Conveniently, the primers are 13bp 
long as this has been found to be sufficient to ensure the reaction, or longer 
to increase hybrid stability. However, to avoid steric hindrance on the chip 
30 each primer may be 5-taiied. In this example, twelve Ts are added to the 
5'-end of the primer so that the final length of the primers is 25bp. 

Next all primers that are not suitable as primers are rejected 
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(12) and the rest is included in a primary primer set. Unsuitable primers 
are those where the three bases at the 3'-end are complementary to any 
substring of the primer. In, some instances this can result in the primer 
being extended by a neighbouring primer and not the sample DNA as a 
5 template and for that reason such primers are considered unsuitable. 

Also, any primers that would produce ambiguous signals are 
identified and rejected (14). A primer produces an ambiguous signal where 
it is not known which of the four bases is in the relevant position. 

Each of the remaining primers in the primary set primer is 
10 then compared to each sequence in turn to determine whether the primer is 
extendible by each sequence and if the primer is extendible the base with 
which it would be extended is determined. A signal matrix of the primers 
with respect to each of the sequences is thus generated (16). 

In order for a primer to be extended using the sample DNA as 
15 template, the three bases in the 3'-end of the primer must hybridise to the 
DNA. Otherwise the enzyme responsible for the extension will not be able 
to add a nucleotide to the primer. Of the rest of the primer (the poly-T tail 
excluded), at most two mismatches are allowed, otherwise the primer-DNA 
duplex is considered to be too unstable to be extended. 
20 In ordinary PCR, ail the bases must match in order for the 

primer to be extended. But then the temperature is raised to the melting 
point, T m , of the primer in the extension step. In APEX, this reaction is 
carried out at 45°C, which is around 10°-20° below T m of most primers. 
This means that the primers will hybridise to the DNA despite a few 
25 mismatches, which is why two mismatches are allowed here. 

In some cases a primer could hybridise to a sequence in 
more than one position, and sometimes a primer could hybridise to both 
strands of one allele and give different signals. In those cases all the 
different signals are combined to form one resulting signal (e.g. 'A' and 'C 
30 together forms 'M', which is the NC-IUB (NC-IUB, 1985) code for this 
combination). 

For each column of the signal matrix the entries for each row 
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are compared against one another, in other words for each primer the 
signals produced by the primer for each sequence are compared against 
each other. A binary matrix is thus generated (18) of the primers with 
respect to the identity or difference of signals for pairs of sequences. The 
5 binary matrix contains non-zero entries where the primer is able to 
distinguish between a pair of sequences. 

The number of pairs of sequences that each primer can 
distinguish between are counted and a score is allocated to each primer 
(20) in dependence on the total number of pairs of sequences counted. 

10 Thus, the number of non-zero elements for each primer are counted. 

Primers that are unable to distinguish between any pairs of sequences are 
rejected (22) and the remaining primers are sorted (24) in order of their 
score with the primers with the higher scores at the beginning. 

A core of primers is created next (26). The primer with the 

15 highest score is selected. Where two primers with equal scores exist, the 
number of positive signals is determined for each and the primer with the 
greater number of positive signals is chosen. If both primers remain equal, 
one is then selected arbitrarily over the other. After the main primer has 
been selected, the first twenty (five times the desired redundancy which is 

20 four here) primers giving positive signals for each sequence in turn are 
selected for the core. All remaining primers are rejected. 

A greedy algorithm is then run (28) using the core set of 
primers to identify the minimum number of primers necessary to distinguish 
each sequence. As the greedy algorithm is run, primers are added one at 

25 a time with each primer being selected in turn in relation to the number of 
uncovered rows it is capable of covering. When all rows are covered at 
least four times the reduced set of primers is checked for any sequences 
that has fewer than four positive signals and extra primers are added as 
necessary to meet this minimum requirement. 

30 A redundancy check is then performed (30) to identify 

whether any more primers can be removed. During the redundancy check 
each primer is "tentatively" removed in turn to see whether the remaining 
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primers meet the minimum requirements. 

If not, the next primer is tried. Otherwise the primer is 
temporarily removed from the set, and the process continues with the next 
primer in line. This process continues until no more primers can be 

5 removed, in which case the last primer to be removed is added back to the 
set, and the next primer in line tentatively removed and so on. This can be 
viewed as a depth-first search of a tree where the nodes are combinations 
of primers, and the number of primers in each node is one less than in a 
node one level above. The root node thus contains all primers from the 

10 greedy algorithm. It has p (the number of primers after the greedy 
algorithm) primers in it. It also has p child-nodes (because there are p 
ways in which you can remove one primer from a set of p primers), each 
with p-1 primers. Each of them has p-1 children with p-2 primers and so 
on. In this way, all possible combinations of primers in the set fulfilling the 

15 requirements are found, and those combinations with the same, least 
number of primers are saved as the final primer sets. 

Instead of applying greedy algorithm to the core set a 
modified algorithm called CFT may be applied. 

20 Laaranaian subaradient 

This algorithm consists of three main phases: A subgradient 

phase where a near-optimal multiplier vector is found, a heuristic phase 

where a solution to the SCP is found and column-fixing, designed to 

improve the results of the heuristic phase. 
25 In the subgradient phase, a near-optimal multiplier vector u is 

found using Equation 8. At the beginning, the starting vector u° used is 

defined as 

0 C j 

u, - mm — - — 
Equation 9 

i 
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Later calls use the last vector u before column fixing, and 
apply a small perturbation before using it as the starting vector. The 
perturbation is randomly (and uniformly) distributed in the range ±10% for 

5 each element. The sequence of multiplier vectors is considered to have 
converged when the improvement in L(u) in the last 50 iterations is smaller 
than 0.1 %, or when the number of iterations reached 10 x m. The factor X 
in Equation 8 was set to 0.1 at the beginning, and was updated as follows: 
Every 20 iterations, the best and worst lower bounds L(u) during those 20 

10 iterations are compared to each other. If the difference is larger than 1 %, 
the value of X is halved. If the difference is less than 0.1 %, X is multiplied 
with 1.5. In the first call, the upper bound, UB, used is the sum of the costs 
of the first primers that together cover all rows four times. Otherwise it is 
the value of the best solution found so far. 

15 In the heuristic phase, the last vector from the subgradient 

phase is used to generate a sequence of multiplier vectors (again using 
Equation 8), and a feasible solution constructed for each of the multiplier 
vectors. The procedure used to generate a feasible solution is a variation 
of the greedy algorithm, where each column is scored according to 

ieR 
ieR 

a (rj'Mj-K rj>o 

20 ' {/jXfij if rj <0 

Equation 10 

where R is the set of uncovered rows in each step. The 
column with the lowest q, i.e. the columns with the best "gain/cosf-ratio, is 
25 added in each step to the solution. This continues until no improvements to 
the best solution (i.e. minimum number of primers) have been made for 50 
iterations. 
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After the heuristic phase column fixing is applied to the 
solution. Columns that are absolutely necessary in order for a row to be 
covered (i.e. if there are only e columns covering a row and each row is to 
be covered e times) are fixed. These fixed columns are then used as a 
5 starting point for the greedy algorithm, and the first max{l_200/mj, 1} 
columns chosen therein are fixed as well. 

These three phases are then applied again to the problem, 
with the condition that the fixed columns must be included in the solution 
this time. Columns already fixed in a previous round can not be removed 
10 from the solution. This goes on until either all rows are covered by the 
fixed columns, or the cost of the fixed columns is larger than the estimated 
lower bound for the entire problem or if no new columns were fixed in the 
last iteration. 

When the three phases are done, the problem is refined, in 
15 order to improve the solution. Here, each column in the best solution found 
so far is scored according to 

8 j = max{c>*),0}+ ]>>,,.«; 
Equation 11 

where 

20 J* 

Equation 12 

and S is the set of columns in the solution. The term u,(K} - 1) 
is the contribution of row / to the gap between the estimated lower and 
25 upper bound of the problem. This is then split uniformly between all 
columns in the solution covering that row. Columns with small Sj 
(contributing the least to the gap) are then likely to be part of the optimal 
solution. The p columns with the smallest & s are then fixed before the entire 
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algorithm is applied again to the resulting sub-problem. (Column fixing 
here has nothing to do with column fixing after the heuristic phase, so 
columns fixed there need no longer be fixed here), p is the smallest value 
satisfying 




5 exm 

Equation 13 

where {/*} is the set of columns in the solution ordered with 
ascending Sj, and lj is the set of rows covered by column j. it is in the range 

10 0...1 and controls the percentage number of rows removed after fixing, n = 
1 means that no rows will be uncovered, while n = 0 means that no 
columns will be fixed before reapplying the algorithm. (Since each row has 
to be covered multiple times, in this case it is not actually the number of 
rows but the number of elements covering the rows that are regulated by 

15 7i). In the beginning, n is set to 0.3 and is multiplied with a = 1 .1 if the best 
solution so far was not improved in the last application of the three main 
phases. If a better solution was found, it was reset to 0.3. Because of the 
density of the matrices, the number of columns fixed in this step was also 
set to be at least one more than in the previous iteration (if no 

20 improvements were made). Otherwise the same number of columns would 
be fixed in a number of iterations before the value of n is large enough to 
allow more columns to be fixed. 

The algorithm is iterated until either the value of the best 
solution is less than the estimated lower bound, all columns in the best 

25 solution found so far are already fixed in the refining step or a time limit is 
exceeded. The time limit in this case was arbitrarily set to as many 
seconds as there were rows in the problem. However, the time limit is only 
checked before the refining step. If it is not exceeded, a whole iteration of 
the algorithm will be executed before another check is done. Here too a 
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check was done afterwards to see if primers could be removed without 
breaking any constraints. 

With this algorithm no pricing is performed. Pricing is used to 
update the core problem, exchanging columns between the core problem 
5 and columns outside the core. It was not included here since it was argued 
that since the costs of the columns are all the same, the best columns 
would be those with the largest number of non-zero elements. These 
would be the first columns to be added to the core, and the columns not 
included in the core would most probably not be better than those included. 
10 Also, the pricing step will require some computation which will extend the 
time required by this algorithm. As is, the computational requirement of this 
algorithm is several orders of magnitudes higher than for the greedy 
algorithm. Finally, the main memory available in the computer puts a limit 
on the how large the problems can be. If pricing was included all data will 
15 not fit into the physical memory, forcing the computer to use a swap^file 
which would increase the computation times considerably. 

Using both alternative algorithms described above a minimum 
number of primers were identified for various sequences. The results are 
set out below. 

20 It will be apparent that the initial manual rejection of primers, 

steps (12, 14 and 22) need not be performed and instead the algorithms 
can be applied to the original complete set of primers. However, the initial 
rejection of obvious failed primer candidates can significantly reduce the 
computational time required in the later stages. Similarly, in many cases 

25 the final redundancy check (30) need not be performed as in many cases 
little or no reduction in the number of primers was achieved by this final 
check. 

Furthermore, although in the method described above the 
primers were initially sorted in order of score, this need not be performed. 
30 The algorithms for stripping out redundant primers are capable of operating 
with any order of primers including a wholly random order. However, 
slightly better results were obtained when ordering by score was 
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performed. 

Collecting sequences 

The HLA-sequences were available internally from 
5 Amersham Pharmacia Biotech (release December 1997), and included 91 
alleles from HLA-A, 202 HLA-B, 47 HLA-C, 1 1 HLA-DPA1 (coding for the 
a-chain), 74 HLA-DPB1 (p-chain), 18 HLA-DQA1 , 34 HLA-DQB1, 192 HLA- 
DR1 and 35 sequences in all of HLA-DR3, -DR4 and -DR5. The length of 
these sequences range from ~250bp to ~1 100bp. 

10 The 16S rRNA-sequences were collected from GenBank 

(Benson et a/., 1998), an annotated database of all publicly available DNA 
sequences. Only a subset of all the available 16S rRNA-sequences were 
used. The sequences used were all from organisms that could be 
identified using either the MicroLog or the MicroStation system from Biolog 

15 Inc., or the API systems from Counterpart Diagnostics. These systems 
utilise differences in metabolism in order to identify the organisms, which is 
the most common way of identifying micro-organisms today. Altogether, 
1207 sequences from 523 different organisms were collected from 
GenBank. 269 of those 523 organisms had only one 16S rRNA sequence 

20 among those 1207 sequences. The length of these sequences is between 
~1000bpand ~1500bp. 



Data set 


No. sequences 


Mean length of sequences 


DPA1 


11 


517 


DPB1 


74 


288 


DQA1 


17 


616 


DQB1 


34 


490 


DRB1 


192 


324 


DRB345 


35 


400 


HLA-A 


91 


944 


HLA-B 


200 


900 


HLA-C 


47 


1003 


16S.rRNA 


1207 


1452 



Table 1: Details about data sets. 
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The program was written using the Microsoft Visual C++ , 
version 5.0 compiler. It was executed on a PC with a Pentium® MMX 233 
MHz processor, 64 MB RAM and Windows® 95, unless otherwise 
indicated. All execution times are for the entire program, including I/O. 

5 As can be seen in Table 2, the binary SCP matrices were 

quite dense. The density (i.e. the number of non-zero elements in the 
matrix) usually lies around a few percent, of course depending on the 
application. A higher density means that fewer columns are needed in 
order to cover all rows. This is offset in this case by the fact that all rows 

10 were required to be covered multiple times. Another consequence of this 
high density is that the number of primers needed according to the greedy 
algorithm could be much higher than in the optimal solution. (Recall that 
the worst case behaviour of the greedy algorithm is a function of the largest 
column-sum of elements.) 



15 



Dataset DPA1 DPB1 


DQA1 


DQB1 


DRB1 


DRB345 HLA-A 


HLA-B 


HLA-C 


16$ rRNA 


No. rows 55 2701 


136 


561 


18336 


595 4095 


19900 


1081 


727821 


Density (%) 47.89 20.73 


36.31 


42.18 


24.98 


37.70 36.31 


32.33 


30.41 


2.04 



Table 2: Some details about the binary SCP matrix. Data are 
calculated for all primers in the primary set. 

The program could be considered as consisting of two 
phases. The first phase involves constructing all primers and finding out 
20 what kind of signal they will get for each sequence. The second phase is 
the optimisation phase, were the SCP is solved. Some details about the 
first phase can be found in Table 3. 



Dataset 


DPA1 


DPB1 DQA1 DQB1 


DRB1 


DRB345 HLA-A 


HLA-B 


HLA-C 16S rRNA 


First set 


1747 


1885 2487 2891 


3891 


3031 4756 


4994 


4293 247877 


Primary set 


1333 


1475 2166 2730 


3651 


3016 3886 


4585 


3354 247877 


Core set 


106 


321 213 244 


385 


203 595 


750 


338 2377 


Time (s) 


4.67 


6.81 11.26 18.51 


42.29 


14.56 124.74 


286.82 


61.29 150632 



Table 3: Number of primers in different stages of the algorithm and time to 
25 get signals for all primers. The number of primers in the core are for 

homozygotes. 
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One explanation to this high density is that the sequences in 
the data sets are quite similar to each other, so that most primers will 
hybridise to and give signal for more than one sequence (either the same 
or different signals). This is also indicated in Table 3, where for some data 
5 sets there is a noticeable drop from the number of primers in the first set to 
the number of primers in the primary set. Most of this reduction is due to a 
primer having the same signal for all sequences, which in turn means that 
all sequences have a substring that is similar enough for the primer to 
hybridise to and that the nucleotide after the primer is the same for all 

10 sequences. In contrast, the 16S rRN A data set has a much lower density, 
and no reduction in the primers going from the first set of primers to the 
primary set. As the sequences in this data set come from organisms which 
might be only distantly related to each other, there need not be as much 
similarity between the sequences as there is in the HLA data sets. Another 

15 explanation is this: If all k sequences except one give the same signal for a 
primer, that column in the binary SCP-matrix will have k-1 non-zero 
elements. The density (for that column) will then be (k-1 ) I (k(k-1)/2) - 2/k. 
In other words, the density will be higher for smaller values of k, and 
smaller for larger values. This means that it would be "natural" for smaller 

20 matrices to have higher densities, and larger matrices to have lower 
densities. 

In the second phase, solving the SCP, a few different 
approaches were tried. The results, the minimum number of primers 
needed and the time required to find this number, can be found in Table 4 

25 and Table 5. Even though the worst case behaviour of the greedy 

algorithm is not so good in this application, the results are not much worse 
than when using a Lagrangian subgradient (CFT) method. The greedy 
algorithm typically needs two or three more primers, while the computation 
times are much lower for the greedy algorithm. 

30 The results show that it is worthwhile to check the results 

from the greedy algorithm for redundancy. In all cases except one primers 
could be removed and the resulting primer sets still fulfil all requirements. 
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This is not true for the CFT algorithm, however, as there is only one 
instance in which the result could be improved. On the other hand, since 
there is some randomness in the CFT algorithm (an old multiplier vector is 
disturbed randomly before being used as a starting vector in the next 
5 iteration), the results can differ from one execution of the algorithm to 
another. Sometimes the results can be improved, and sometimes not 
(results not shown). 

Dataset DPA1 DPB1 DQA1 DQB1 DRB1 DRB345 HLA-A HLA-B HLA-C 16S rRNA 

Greedy 11 42 32 31 48 24 73 103 51 210 

Tlme(s) 0.27 1.37 0.61 0.71 11.5 0.66 4.61 31.36 1.15 9921.48* 

Final 11 41 30 29 44 21 72 99 47 197 A 

Total (s) 0.27 1.81 0.72 0.88 30.3 0.71 6.48 85.14 1.76 >300000 A 

Table 4: No. of primers after the greedy algorithm and time 
10 spent by it. Also final nr. of primers after check for redundancy and the total 
time spent solving the SCP. *Value from a 300MHz Pentium II with 512MB 
RAM running Windows NT 4.0. The computation was halted before 
completion due to time constraints. 



Dataset DPA1 DPB1 DQA1 DQB1 DRB345 HLA-A HLA-C 

CFT 10 38 26 27 20 69 47 

Tlme(s) 10.22 2748.92 60.80 372.56 427.32 4547.33 1091.37 

Final 10 38 26 27 20 69 45 

Totalis) 10.22 2749.14 60.86 372.61 427.38 4548.49 1111.70 

15 Table 5: Results using modified algorithm CFT. 

One reason CFT is not much better than the greedy algorithm 
could be that it was designed for other instances of SCP. The SCP arising 
in this application differ in three aspects from those: A) The density is much 
20 higher, B) All rows are to be covered multiple times and C) The costs of all 
columns are all the same. 

A comparison was made between the results from the greedy 
algorithm and from CFT in Table 6. Most of the primers (70% or more) 
were chosen by both algorithms, indicating that these primers are likely to 
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be part of an optimal solution. However, this is only an indication as the 
only way to prove this is to find an optimal solution. This will require far too 
much time even for the smallest data set as the problem is NP-hard. 



Dataset 


DPA1 


DPB1 


DQA1 


DQB1 


DRB345 


HLA-A 


HLA-C 


Greedy 


11 


41 


30 


29 


21 


72 


47 


CFT 


10 


38 


26 


27 


20 


69 


48 


Same 


7 


33 


22 


22 


14 


62 


38 


Percent (%) 


70.00 


86.84 


84.62 


81.48 


70.00 


89.86 


80.85 



5 Table 6: Comparison of primers from the two different 

algorithms. 

Results from combining HLA sequences in order to 
differentiate between heterozygous individuals can be found in Table 7. 

10 CFT was only used for the two smallest data sets due to the time re- 
quirements. It performed slightly better than the greedy algorithm on those, 
but only by one primer on each data set. There are heterozygotes that can 
not be distinguished from another heterozygote, which can be seen in 
Table 7. This happens because the combination of two sequences to form 

is one heterozygote could result in exactly the same signal pattern as another 
combination of homozygotes. In other words, some rows in the signal- 
matrix will be the same leading to some rows in the binary SCP-matrix not 
containing any non-zero elements at all. For some of those pairs listed, 
this is not true, however. They are listed because there were not enough 

20 primers that have different signals for these pairs, and so could not meet 
the requirement of at least four different signals in the signal patterns 
(Table 8). For the rest, it is simply a limitation of this technique to type 
HLA-genes. To be able to identify the alleles forming each heterozygote, 
primers that amplify alleles selectively should be used in the PCR step. 

25 This will remove the ambiguities as some heterozygotes simply will be 
transformed to homozygotes since only one of the alleles in the 
heterozygote will be amplified and not the other. 
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Dataset 


DPA1 


DPB1 


DQA1 


PQB1 


DRB345 


HLA-A 


HLA-C 


Greedy 


26 


130 


51 


81 


94 


172 


94 


Time (s) 


0.99 


9229.57 


7.41 


294.51 


453.19 


20826.20* 


1212.59 


CFT 


25 




50 










Time (s) 


1943.82 




8427.82 










Amb. het 


0 


16 


2 


2 


6 


19 


4 


Percent (%) 


0.00 


0.58 


1.31 


0.34 


0.95 


0.45 


0.35 



Table 7: Results from heterozygous pairs. Number of primers 
needed, the time spent, how many heterozygotes that did not differ by at 
5 least four signals from any other heterozygote and the percentage of total 
number of heterozygotes. *Value from a 300MHz Pentium II with 512MB 

RAM running Windows NT 4.0. 

Unfortunately, it was not possible to obtain any results for 
10 heterozygotes for the data sets DRB1 and HLA-B, as these were too large 
to run on existing machines. A very approximate extrapolation of the 
primers needed for these data sets suggests that the total number of 
primers for all HLA sets together would be <1000, which can placed on one 
chip without problem (one chip can contain up to -5000 primers). Without 
15 the reduction obtained above, at most two genes could be tested on each 
chip. With the reduction, all nine HLA genes and the 16S rRNA gene can 
be tested on one chip, and with plenty of room to spare for other genes as 
well. This makes APEX more versatile, as it allows a family of related 
genes to be tested using only one chip instead of several. 



WO 00/65088 



PCT/EPOO/03636 



-26- 



3PB1 



Pair 1 DPBI'0501 DPBK2101 
Pair 2 OPBI'2201 OPBI'3601 
No. dlff. 2 

Palrl OPBV0501 DPBI'5501 
Pair 2 OP81'3001 OPBV6301 
No. dlff. 2 

Palf1 DPB1*0601 DPB1*3601 
Pair 2 DPB1*2OO11DPB1'2101 
No, dlff. 1 

Palrl DPBI'0801 DPB1M401 
pair 2 DPBV1001 OPB1-5701 
No. dlff. 0 

Palrl OPB1-0901 OPB1*3001 
Pair 2 DPB1M701 OPBI'5401 
No. dlff. 0 

Palrl OPBfO901 OPBV3601 
Pair 2 DPB1'2101 OPBV3501 
No. dlff. 0 

Palrl OPB1*090t OPBV4501 
Pair 2 DPB1M001 OPBV1401 
No. dlff. 0 

Palrl OPB1*3901 OP8VS301 
Pair 2 DPB1*4001 DPBV4901 
No. dlff. 0 



OOA1 



DQB1 



DRB343 



Palrl DQA1-OI01 OQA1*0104 
Pair 2 DQAV0101 DQAI'0105 
No. dlff. 3 

Palrl DQBT0604 0081*0612 
Pair 2 DQB1'0808 OQBV0609 
NO. dlff. * 

Palrl DRB4TJ1011DRB4-01011 
Pair 2 ORB4*01011 ORB4-03O1N 
No. dlff. 0 

Palrl DRB4TJ1011DRB4-0103 
Pair 2 ORB4*0103 DR84'0301N 
No. dlff. 0 



DRB4"0201NORB4*02O1N 
DRB4"0201NDRB4 B 03O1N 
0 



HLA-C 



Pair 1 
Pair 2 
No. dlff. 

Pair 1 
Pair 2 

No. dlff. 



Palrl CW12042 CW1502 
Pair 2 CW1205 CW1503 
No. dlff. 0 



CW1203 
Cw'12042 



CW1602 
CW1601 



Palrl A*0101 A*2411N 
Pair 2 A-0104N A'2402 
No. dlff. 0 

Palrl A-0201 A'0205 
Pair 2 A-0202 A"0206 
No. dlff. 1 

Pair 1 A-0201 A*0205 
Pair 2 A-0214 A'0222 
No. dlff. 1 

Palrl A'0201 A'0208 
Pair 2 A'0205 A'0220 
No. dlff- 0 

Palrl A'0201 A*0213 
Pair 2 A'0212 A*0226 
No. dlff. 2 

Palrl A-0201 A'2408 
Pair 2 A'0222 A«24i3 
NO. dlff. 0 

Palrl A'02D2 A*0206 
Pair 2 A'0214 A'0222 
No. dlff. 0 

Palrl A-0212 A-2801 
Pair 2 A'0222 A'2808 
No. dlfl. 2 

Pair 1 A«2402 A*2502 
Pair 2 A*2407 A"2501 
NO. dlff. 0 

Pain A-2402 A'68012 
Pair 2 A*2407 A'88031 
No. dlff. . 0 

Palrl A'2501 A # 68012 
Pair 2 A'2502 A-B8031 
"«>• dlff. 2 



Table 8: Heterozygous pairs that do not differ enough in their signal 
patterns, and how many signals they differ with. 

The results of this work are summarised in the following 



Table 9 



WO 00/65088 



PCT/EP00/03636 



-27- 



Class 1 


Number of 


Primers 


Class II 


Number of 


Primers 




alleles 


needed 




alleles 


needed 


HLA-A 


91 


172 


DPA1 


11 


26 


III A Q 

HLA-b 






DPB1 


74 


130 


HLA-C 


47 


94 


DQA1 


17 


51 








DQB1 


34 


84 








DRB1 


192 


<1000 








DRB345 


35 


94 



Table 9. Number of primers needed to discriminate between 
heterozygote HLA samples. 

5 

Some sets of primers indicated in Table 9, and also the set 
indicated for 1 6S rRNA, are set out in appendix 2. 

Primers can be arranged on the surface of a support in such 
a way that different studied types, genes, alleles, species etc. form easily 
10 recognised characters such as figures or letters. These character forming 
primers can be additional primers of common origin from the gene of 
interest and be used for validation of the process. 

The following demonstration is based on the HLA Class II 

DQB gene. 

15 

Experimental 

Materials 

Amplification: 

20 DNA: Four homozygote for DQB cell lines, with alleles 0402, 0301 , 0601 1 
and 0201. 

Primers: Primer DQB 9246 from Williams et a/. -96 and DQB 96012 from 
Amersham Pharmacia Biotech HLA DQB typing kit, covering exon 2, 
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generating a fragment of 300 base pairs. 

Amplification reagents: PCR mix from the Amersham Pharmacia Biotech 
HLA DQB typing kit, a prototype kit. 

All amplifications were spiked with dUTP, to get a final concentration of 100 
5 or200mMdUTP. 

Enzymes for fragmentation of PCR products: 
Shrimp alkaline phosphatase (SAP)1 U/^l APB. 
Uracil-DNA-glycosylase, (if from PE UDG = UNG) 1 U/jil NE Biolabs. 

10 

SAP will degrade (dephosphorylate) all free dNTPs and UDG 
will remove all dU from the DNA and after heating the strands will be 
broken at these points. This step is applicable to any DNA fragment. 

15 Primers for spotting: 

All 84 primers for the 500 bp fragment were ordered from 
LTI/GIBCO BRL Custom primers service. All were 25-mers with an amino- 
activated 5* -end. For primer sequences see appendix 1 . Self extended 
primers were N, A, C, G and T as controls with the following sequences: 

20 N: amino TTT AGC CTT AAC GCC T N TGAC GTCA 

A, C,G, T: amino TTT AGC CTT AAC GCC T X TGAC GTCA, where X is 
A, C, GorT. 

Extension reagents for the APEX reaction 

25 Dyes: Specially synthesised for Baylor by Du Pont and /or APB 

Cy2 - ddCTP (equal to fluorescein) 50 
Cy3-ddATP 50 \iM 

Texas Red - ddGTP 50 nM 

Cy5 - ddUTP (often written as T in many of the reactions and 

30 results) 50 \iM 

lOx ThermoSequenase™ DNA polymerase buffer (TS): 
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260 mM Tris-HCI pH 9.5; 65 mM MgCI 2 , ThermoSequenase DNA 
polymerase (Amersham Pharmacia Biotech) 4 U/>il, if needed dilute with 
T.S. dilution buffer (=10 mM Tris-HCI pH 8.0; 1 mM p-mercaptoethanol, 
0.5% Tween - 20(v/v), 0.5% Nonidet P-40 (v/v). TS was used from a 150 
5 unit stock and diluted 1 \i\ + 37 jal dilution buffer. 

Methods 

Preparation of glass slides before spotting of primer: 

Arrange 25-30 cover slips (24 x 60 mm) in a stainless staining 

10 tray. 

Immerse the tray in glass staining dish with acetone to fully 

immerse slides. 

Place the glass staining dish in sonicator for 10 minutes. 
Remove the tray from acetone bath, shake of excess of 
15 acetone and rinse several times (at least twice) in MilliQ water. 

Immerse tray in 100 mM NaOH and sonicate for 10 minutes 
(a few more minutes, no problem). 

Remove the tray and shake of excess of NaOH and rinse 
several times (at least twice) in MilliQ water. 
20 Immerse tray in silane solution and sonicate for 2 minutes. 

Wash slides by immersion in 100% EtOH once. 
Dry the tray with the slides using nitrogen with a high velocity 
(without breaking the slides). 

Cure the slides in a vacuum oven at 100°C over night or until 
25 they are used for spotting (at least 20 minutes vacuum is needed). 

Spotting of oliqos: 

All spotting was done with a spotter with 96 parallel capacity. 

Each slide was spotted with three replicas of the primers. 
3 0 After spotting the slides were allowed to air dry for 5 to 1 5 

minutes, when dried they were marked. They were stored at room 
temperature, in a dry place, in the trays until used. 
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DOR am plification 

The DQB amplification was done according to the method 
described by Williams et a/. -96 using a 33% dUTP mix. After 40 cycles 
(95°C, 30 sec; 55°C, 30 sec; 72°C, 30 sec), one microliter of the PCR 
5 products was tested on a 1 .5% agarose gel, before the fragmentation step. 

Williams, Bassinger, Moehlenkamp, Wu, Montoya, Griffith, 
McAuley, Goldman, Maurer: Strategy for distinguishing a new DQB1 allele 
(DQB1*061 1) from the closely related DQB1*0602 allele Tissue Antigens, 
1996, 48:143-147. 

io 

Fragmentati on nf PCR products: 

Before APEX can be done all DNA fragments must be 
fragmented so all new fragments can get access to the primer on the chip. 

15 Setup: 

5 jil DNA from a PCR reaction (1/10 of the PCR reaction) 
2 nl SAP (Shrimp alkaline phosphatase) 1 U/^il APB 
1 \i\ UDG (Uracil-DNA-glycosylase) 1U/ul NE Biolabs 
15^1 water 
20 Total: 23 ul 

Incubate 37°C for 2 hour. 

The samples were frozen and stored until they were used. 

Inactivation of enzymes at 100°C for 10 minutes can be done, 
but not needed since this is the first step in the APEX reaction. 

25 

Extension method for the APEX reaction 

Slide treatment: 

Start with washing the slides in hot water (90 - 98°C, not 
30 boiling) for 2 x 5 minutes in a 50 ml Flacon tube. When the slides are 
ready, remove them from the tube with a forceps and place them on a dry 
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heater block at 48°C. The slide(=DNA chip) is now ready for adding the 
reactions. 

APEX reactions set up: 

5 

23 yi DNA from the fragmentation step. 

3 jal 10x TS reaction buffer (the rest of the buffer comes from PCR and 

UDG cleavage) 

17 Hi for cover slip method. 
10 Heat denature at 100°C for 7-10 minutes, target 8 minutes, not longer. 

Spin the tube quickly and add quickly 

1 \i\ ThermoSequenase DNA polymerase (4U) 

1 uJ Dye-mix (50 uM of the four dideoxynucleotides A, C, G, and T, 

separately dye labelled). 
15 Then the reaction mix was physically spread out over the 

primer array with the tip of a pipette tip. Incubate at 48°C until no trace of 

solution is seen. This takes about 8 minutes. 

Wash with hot water for 2 - 5 minutes, 2 times. Ready to 

read on detection instrument. 

20 

Detection 

The detection system is a total internal reflection fluorescence 
(TIRF) system, where microscopic slides are placed on top of a prism with 
oil on to link a laser beam in to the glass slide. The system has light of five 
25 different wave lengths from five different lasers to vary between. In this 
experiment only four were used. To detect Cy2 a laser with 488 nm was 
used, for Cy3 a 532 nm, for Cy5 a 635 nm and for Texas Red a 670 nm 
laser were used. Image related software were based on Image Pro Plus 
3.0. 



30 
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Results 

Amplification of HLA DQB alleles 

The DNA from the four DQB homozygote cell lines were 
5 amplified according to the protocol in Williams et al. -96 with two different 
concentrations of dUTP. In addition to this, DNA from six different 
heterozygotes were amplified. All amplifications worked well and the 
expected 300 bp fragment were seen from all samples. 

10 APEX reaction with DQB chip 

Primer chips were washed and fragmented PCR products 
were incubated on the chip according to the protocol. The image was 
compared to the expected pattern. The expected pattern was similar to but 
somewhat different from the recorded pattern, the reason for this is that the 

15 set up was planned for a 500 bp fragment, but the actual fragment used 
was a 300 bp PCR fragment. 

Homozygous cell lines results 

Figure 4 shows the results from a cell line homozygous for 
20 the DQB 0204 allele. The pattern shown in the image is very close or 

similar to the expected results from exon 2. 

In all reaction the control primers worked well and the four 

dyes were used in the same frequencies. In the case with a 500 bp 

fragment for DQB typing the primers for allele 0402 were placed in such a 
25 way that they formed figures. In Figure 4, panel D, most signals are seen 

forming a "2" from the 300 bp fragment, and the missing signal will be seen 

when the large PCR fragment is used. This clearly shows that primers can 

be placed in a clever way to form figures. 

30 Heterozygous results 

For the heterozygous test only one of the four dye reactions 
worked. Some of the expected spots from the heterozygous sample were 
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not seen, but this is probably due to the fact that no control signals were 
seen in the lower right hand corner, where the signals were weaker then in 

other part of the slide. 

As this experiment shows, a limited number of primers can be 
5 used for HLA typing and if they are placed in a clever way the interpretation 
of the results is very simple. Both homozygous and heterozygous samples 
can be correctly analysed with this method. 

Continuation 

10 An algorithm was developed in order to select the minimum 

number of primers needed to identify different genes using APEX. It was 
applied to the following HLA genes: HLA-A, HLA-B, HLA-C, HLA-DPA1 , 
HLA-DPB1 , HLA-DQA1, HLA-DQB1, HLA-DRB1 and HLA-DRB345. It was 
also applied to the 16S rRNA gene. In the case of HLA-DQB1 , the primers 

15 have been shown to work as intended. As is, a few assumptions were 
made (such as how many mismatches to be allowed between the primers 
and the sample DNA) that need to be tested and possibly refined. 

Another improvement that can be made is the following: As is, 
the program works only with discrete signals, e.g. either there is a signal 'A' 

20 or there is not, either there is a signal 'G' or there is not and so on. A more 
precise approach would be to predict how strong the signals will be for 
each primer on each sequence. A rough estimate of the signal strength 
should be possible given some thermodynamic data about the primers, 
most notably their melting points. With this information, and knowing the 

25 concentration of DNA in the sample among other things, the proportion of 
primers on the chip that will actually react with the sample DNA should be 
possible to estimate. It would thus allow a rough estimation of what 
strength the different signals will have. It will not be very precise, and the 
estimate might possibly be off by a factor 2 or more, but it will still give 

30 some information about what signals to expect from the chip. 

Given the melting points of the primers, the temperature at 
which the reaction on the chip is carried out could be optimised as well. 
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Since the sequences are known, it is possible to estimate the melting point 
of any primer to any sequence when there are a few mismatches. This 
could be done for all primers on all sequences, and a range of 
temperatures calculated. The actual temperature to use could then be 
5 chosen so as to be as optimal for as many primers on as many sequences 
as possible, instead of as now at a standard temperature. 

Another possibility would be to try other heuristics to solve the 
resulting SCP. Even though CFT does give better results than the greedy 
algorithm, it is not by much. It could be that Lagrangian relaxation methods 
io really are not suitable for unicost problems, but the only way to find out is to 
try heuristics based on other ideas. It might be possible to reduce the 
binary SCP-matrix as well, before applying any heuristic on it. Some rows 
in the matrix could end up the same, in which case one of them could be 
removed in order to reduce the number of rows and thus speed up 
is computation. No figures of how many rows might be the same exist, but it 
could be worthwhile examining this possibility to reduce problem size. 

The algorithm itself could be improved. The complexity of the 
redundancy-check phase can be slightly reduced by having a vector 
consisting of the sums of the rows in each node. For each child-node, the 
column to be removed is then subtracted from this vector of sums. This 
operation can be carried out in O(m), and the final complexity will then be 
0(m x N(p, p)) instead. For the greedy algorithm, another possible 
improvement is to check the primer set for redundancy each time a primer 
was added. The complexity for the greedy algorithm will be the same, as 
25 the check will take 0(m xp) (i.e. same as each iteration in the greedy 
algorithm) each time (with the improvement just mentioned). The check 
could take longer, but that is unlikely as that would imply that one primer 
could make several other primers redundant. The main advantage is, of 
course, that no redundancy check with its rather high complexity is needed 
30 afterwards. 

The most serious problem is the sheer size of the problems. 
For the 16S rRNA data set, around 300 MB is required just in order to store 
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25 



all the primers and their signals. Add to that the fact the all primers need to 
be traversed once for every iteration in the greedy algorithm, and the result 
is that it will take quite some time as well. This also means that it is not 
even feasible to use more elaborate algorithms such as the CFT algorithm 
on the 16S rRNA data set, unless a much more powerful computer is 
available. On the other hand, algorithm CFT would probably benefit quite a 
lot from a parallel computer, since much computation could be earned out 
as vector-operations. It should then be possible to spread out all 
computations on several processors, thus reducing the time required. It 
would also reduce the memory requirements on each processor (but then 
parallel computers tend to have enough memory to store all necessary data 
for this problem on each processor anyway). Even the greedy algorithm 
would benefit from a parallel computer, as each processor can be charged 
with the task of scoring only a subset of primers. It is not as critical in this 
case, though, since the computation times are not very high when using the 

greedy algorithm. 

As is, this method is only capable of identifying known gene- 
variants. If applied to a sample with a previously unknown variant, it Is very 
probable that this new variant will be falsely identified as one of the known 
variants. It would be very advantageous if this method could be 
augmented in some way to recognise this fact, and give a warning if there 
could be an unknown variant in the sample. It could be done by giving a 
warning when the signal pattern gained differs from the signal pattern from 
any known variants, but this might not be enough. There is no guarantee 
that the new variant could not differ in some place not affecting any of the 
existing primers, which would lead to the new variant being 
indistinguishable from any of the known variants. Some other way is 
probably needed as well. . 
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APPENDIX 1 

Primer sequences for DBQ heterozygote typing 

Primers 'dqbl -V to 'dqbl -8' placed in positions A3-A10. 
Primers 'dqbl -9' to *dqb1 -18' placed in positions B2-B1 1 . 
Primers 'dqbl -19* to "dqbl -30' placed in positions C1-C12. 
Primers 'dqbl -31' to 'dqbl -42' placed in positions D1-012. 
Primers 'dqbl -43' to 'dqbl -54* placed in positions E1-E12. 

Primers 'dqbl -55' to 'dqbl -66' placed in positions F1-F12. 
Primers 'dqbl -67 to 'dqbl -76' placed in positions G2-G1 1. 
Primers 'dqbl -77' to 'dqbl -84' placed in positions H3-H10. 



dob1 -1 NH2 - TCC ATC ACA GGA GTC AGA AAG GGC T 
dqb1-2 NH2 - GTG TGC AGA CAC AAC TAC GAG i GTG I G 
15 dqb1-3 NH2 - GCG GTG ACG CTG CTG GGG CTG CCT G 
S3b1-4 NH2 - TAA TGA GGG GGG TGG ACA CM CGC C 
dab1 -5 NH2 - GCG GTG ACG CCG CTG GGG CCG CCT G 
dqb1-6 NH2 - GGA CAT CCT GGA GGA GGA CCG GGC G 
Sb1-7 NH2 - GTG GTG ACG CCG CTG GGG CCG i CCT G 
20 dabH -8 NH2 - TCC GTC AAA GGA GTC AGA AAG GGC T 
Sl-9 NH2 1 GAT GTA TCT GGT CAC ACC CCG CAC G 
dqb1-10 NH2 - CCG AGT ACT GGA ATA GCC AGA AGG A 
dab1-11 NH2 - GAT GTG TCT GGT CAC ACC CCG CAC G 
23b -12 NH2 - GGG TGG ACA CAA CGC CGG CTG TCT C 
25 dqb1-13 NH2 - GGG TGG ACA CAA CGC CGG JTG TCT C 
dabl-14 NH2 - CTT CTG GCT ATT CCA GTA CTC GGC G 
Z - IS NH2 - TTC CGG GCG GTG ACG CTG CTG GGG C 
dqbl -1 6 NH2 - GCT TCG ACA GCG ACG TGG GjGGTGJ A 
dab1-17 NH2 - GCT GTT CCA GTA CTC GGC GCT AGG C 
30 22b -18 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC G 
3 23b - 9 NH2 - ACC GTG TCC AAC TCC GCC CGG GTC C 
22b1 -20 NH2 - CAC AAC GCC GGT TGT CTC CTC CTG G 
dab1 -21 NH2 - CTC CTC CTG GTC ATT CCG AAA CCA C 
dab1-22 NH2 - CCA GGA TCT GGA AAG TCC AGT CAC C 
35 dab1 -23 NH2 - GAG CGC GTG CGT CTT GTA ACC AGA T 
35 23b1 2A IS - GAC ATC CTG GAG AGG AAA CGG GCG , G 
dqbl -25 NH2 - AGA GAC TCT CCC GAG GAT TTC GTG T 
d3b1-26 NH2 - TAG TTG TGT CTG CAC ACC CTC TCC A 
dab1-27 NH2 - ACG TAC TCC TCT CGG TTA TAG ATG T 
40 dabl -28 NH2 - GCT TCG ACA GCG ACG TGG AGG TGT A 
23b -29 NH2 - TCC GTC CCA TTG GTG AAG TAG XAC ^A 
dqbl -30 NH2 - TGA TAA GGC CCA GCC CGA GGA AGA T 
d3b1-31 NH2 - GGG TGG ACA CAA CGC CAG TTG TCT C 
dab1-32 NH2 - GGG TGG ACA CAA CGC CAG CTG TCT C 
45 22b -33 NH2 - GAC AGC GAC GTG GAG GTG [TAG CGG i G 
d3b1-34 NH2 - TCC GTC CCG TTG GTG AAG TTAG I CAC ; A 
dqb1-35 NH2 - GCA CGA CCT TGC AGC GGC GAC CCC A 
dqb1-36 NH2 - GAA CAG CCA GAA GGA AGT CCT GGA G 
d2b1-37 NH2 - CTT CTG GCT GTT CCA GTA CTC GGC A 
50 dqb1-38 NH2 - AAC GCC AGC TGT CTC ^TTC CTG GTC A 
dqb1-39 NH2 - GAG AGG ACC CGG GCG GAG TTG GAC A 
dqbl-40 NH2 - GCA GGC GGC CCC AGC GGC GTC ACC A 
dqbl-41 NH2 - GTC GCT GTC GAA GCG CAC GTC CTC C 
dqbl-42 NH2 - CTC TGT CCT GGA TGG GGT CGC^CGC T 
55 dqbl-43 NH2 - ACG GGA CGG AGC GCG TGC GJTATG T 
dqbl-44 NH2 - GAA GTA GCA CAT GCC CTT AAA CTG G 
dqb1^5 NH2 - TCG GTG GAC ACC GTA TGC AGA CAC A 
dqb1^6 NH2 - GGA M CGT GTA CCA GTT TAA GGG C 
dqbl-47 NH2 - ACG TAC TCT TCT CGG TTA TAG ATG T 
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dqb1-48 NH2 - GAG AGG ACC CGA GCG GAG TTG GAC A 
dqb1-49 NH2 - ACC CCA GCC TCC AGA GCC CCA TCA C 
dqb1-50 NH2 - CAA CGG GAC GGA GCG CGT GCG GGG T 
dqb1-51 NH2 - ACA TCT ATA ACC GAG AGG AGT ACG C 

5 dqb1-52 NH2 - GAA CAG CCA GAA GGA CAT CCT GGA G 
dqb1-53 NH2 - CCT TCT GGC TAT TCC AGT ACT CGG C 
dqb1-54 NH2 - TTA AGG CCA TGT GCT ACT TCA CCA A 
dqb1-55 NH2 - TTC AGA TTG AGC CCG CCA CTC CAC G 
dqb1-56 NH2 - ATC TGG TCA CAA GAC GCA CGC GCT C 

10 dqb1-57 NH2 - AGT AGC ACA GGC CCT TAA ACT GGT A 
dqb1-58 NH2 - ATG TAT CTG GTC ACA CCC CGC ACG A 
dqb1-59 NH2 - ATC TGG TCA CAT AAC GCA CGC GCT C 
dqb1-60 NH2 - ATC AAA GTC CAG TGG M CGG AAT G 
dqb1-61 NH2 - ACG TGG GGG TGT ATC GGG TGG TGA C 

15 dqb1-62 NH2 - ATC AAA GTC CGG TGG M CGG AAT G 
dqbl -63 NH2 - GTA TCT GGT CAC ACC CCG CAC GAG C 
dqb1-64 NH2 - CGC TGT CGA AGC GCA CGT CCT CCT C 
dqb1-65 NH2 - GGA M CGT GTT CCA GTT TAA GGG C 
dqb1-66 NH2 - TGT GGG CTC CAC TCT CCT CTG CAA G 

20 dqb1-67 NH2 - ACG TCC TCC TCT CGG TTA TAG ATG T 
dqbl -68 NH2 - TTG CAG CGG CGA CCC CAT CCA GGA C 
dqb1-69 NH2 - GAA GTA GCA CAG GCC CTT AAA CTG G 
dqb1-70 N H2 - GAA GTA GCA CAT GGC CTT AAA CTG G 
dqb1-71 NH2 - TCG ACA GCG ACG TGG GGG TGT ACC G 

25 dqb1-72 NH2 - TCG ACA GCG ACG TGG GGG AGT TCC G 
dqb1-73 NH2 - TGT GGG CTC CAC TCG CCG CTG CAA G 
dqb1-74 NH2 - CGG CGT CAG GCC GCC CCT GCG GGG T 
dqb1-75 N H2 - TCG ACA GCG ACG TGG AGG TGT ACC G 
dqb1-76 NH2 - GCG TTG GAG GCT TCG TGC TGG GGC T 

30 dqb1-77 NH2 - CGG TGA CCC CGC AGG GGC GGC CTG A 
dqb1-78 NH2 - ATG GGA CGG AGC GCG TGC GTT ATG T 
dqb1-79 NH2 - CGG TGA CGC CGC TGG GGC GGC TTG A 
dqb1-80 NH2 - ACG GGA CGG AGC GCG TGC GTC TTG T 
dqb1-81 NH2 - TGA TAA GGC CAA GCC CAA GGA AGA T 

35 dqb1-82 NH2 - GAG ACT CTC CCG AGG ATT TCG TGT A 
dqb1-83 NH2 - CGT CGC TGT CGA AGC GCA CGT CCT C 
dqb1-84 NH2 - GAC TCT CCC GAG GAT TTC GTG TAC C 
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Homozvaotes 



(From CFT if available, otherwise greedy algorithm). 
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TT 



TTT 
TTT 



TTT 
TTT 



TTT 
TTT 



TT 



TTGCCCAGGGCACAG 
TTAAGGAAAAGGCTC 
TTGGATCTGGACAA 



TTTCTGGCCCAGCTCC 



1 1 I I I GTACAGACCCA 

TTTAGGGGACCCTGTG 

TTTGGCGGACCATGTG 

TTTCTGCTCATCTTCA 

TTTGTCAACTTATGCC 



TTTTCAGGCCGCCAAT 
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FCAACCGGGAGGAG 

TGGCCTGACGAGGA 

TCAACCTGGAGGAG 

TTCCAGTACTCCTC 

TTGCCGTAACTGGT 

TTGGGGCGGCCTGA 

TGCGCGTACTCCTC 

TTGGACAGGAGGAA 

TCACAGGAGGAGCA 

TTTGCTCCTCCTGT 



rGGCAATGCCCGCT 

TGGCACTGCCCGCT 

TAGAGAATTACGTG 

TTCCAGAGAATTAC 

TAACTACG AGCTGG 

TGGTCATGGGCCCG 

TTGACCCTGCAGCG 

TTACACGTAATTCT 

TGTAACTGGTACAC 



TCTGACGAGGAGTA 

TTTACCTTTTCCAG 

TCCTGGAAAAGGTA 

TGAGAATTACCTTT 

TGCCTGACGAGGAG 

TACTGGTGCACGTA 

TTCCTCCAGGATGT 

TCGGGAGGAGCTCG 



TAGCCAGAAGGACA 

TCAGCCAGAAGGAC 

TAGTGCCGGACAGG 

TATTGCCGGACAGG 

TCCTGCAGCGCCGA 

TAGAGAATTACCTT 

TGGACTCGGCGCTG 

TACTACGAGCTGGG 

TGCTTCGTGCTGGG 



TGTCCCTGGTACAC 
TGCGCTGCAGGGTC 



TACATCCTCATCTG 

TACACCCTCATCTG 

TCAAGTTTACACCA 

TCAGCCACAATGTC 

TTCCAAGTCTCCCG 

TCGGGAGACTTGGA 

TAATTCATGGCTGT 

TACAATCCCAGGGC 

TACAACCCCAGGGC 

TGTGGGCATTGTGG 

TCCAACACCCTCAT 

TGGCCCACAGACAA 

TCATGGGCATTGTG 



TGGCCTGGATGAGC 

TAGGCTCATCCAGG 

TCAACACCCTCATT 

TAGCACTGGGGACT 

TAAGGGCCATTGTG 
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TT 
TT 
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TT 



TTT 
TTT 



TTT 
TTT 



TTT 
TTT 



TTT 



TT 



TAAATTCATGGGTG 

TCACCATAAGAGGC 

TCACCACAAGAGGC 

TCACCGTAAGAGGC 

TTTCCTCCCTTCTG 

TTAACTCTCCTCAG 

TTAAATCTCATCAG 

TCTCCTCCCTTCTG 



TATCTTGCAGAGGA 

TCCTCTCCAGGATG 

TGGGTCACCGCCCG 

TGGGAGTTCCGGGC 

TCGCTCGGGTCCTC 

TCCAGTACTCGGCG 

CTGGGGCCGCCTG 
TATGTCTACACCTG 
TAAAGGGCTTCTGC 
TAGCATCACCAGGA 
TGCCAGGAGGAGAC 
TACCAGGAGGAGAC 
TGGTTTCGGAATGA 
TGGGTGTATCGGGT 
TGTCGGAAAGGGCT 
TTGGTTTCGGAATG 
TCCAGTACTCGGCA 
TAGCGCACGATCTC 
TGTCTCTTCCTGGT 
TCGTCAAGCCGCCC 
TGCGTCAAGCCGCC 
TCAAGGTCGTGCGG 
TCGGTTATAGATGT 
TTGTAACCAGACAC 

GTATGCAGACACA 
TCACACCCCGCACG 
TACACCCCGCACGC 



TGCAAGTCCTCCTC 

TTTCTCCTCCCGGT 

TCCACAACCCGGTA 

TGGCCAGGTGGACA 

TGCGGTTCCTGGAG 

TCAGCCAGAAGGAC 

TGACTCGCCTCTGC 

TTCCAGGACTCGGC 

TGAAATAACACTCA 

TTGGAGGACAGGCG 



TACGTGGTCGGGTG 
TTACTCCAAGAAAC 
TACGGTGTCCACCT 
TGGAGAGGTTTACA 
TCCAGTACTCGGCA 
TGGAGTACTCTACG 
■GTGTAAACCTCTC 
TCGGTGCAGCGGCG 
TGGAGGAGTTCCTG 
TTGGAAGACGAGCG 
TCAGGAGGTTGTGG 
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MINI 


TTTCCAGGAGGAGTTC 




TTl 


iiiiii 


TTTGTAATTGTCCACC 


10 


1 1 


MINI 


TTTTCGTAGCGCGCGT 




i n 


IIIIII 


TTTAAGATGCATCTAT 




1 1 


IIIIII 


TTTTACGTCTGAGTGT 




n 


[IIIIII 


TTTCCAGTACTCAGCA 




TTl 


IIIIII 


TTTCGTAGCGCGCGTA 


15 


TTl 


IIIIII 


TTTATCTCTCCACAAC 




1 1 


IIIIII 


TTTGAGCTCCTCCTGG 




1 1 


IIIIII 


1 1 IAACCAGGAGGAGT 




1 1 


IIIIII 


TTTAGGGCCCGCCTGT 




n 


IIIIII 


TTTGGAGAGCTTCACA 


20 


1 1 


IIIIII 


TT 1 GGAGAGATTCACA 




IT 


IIIIII 


1 1 1 1 CACCGCCCGGTA 




in 


IIIIII 


TTTAACTACCGGGTTG 




1 1 


IIIIII 


TTTCCAGTACTGGGCA 




DRB345 




25 


n 


1 1 1 1 1 1 1 


1 1 IGTATCTGTCCAGG 






1 1 1 1 1 1 1 


TTTGACTGGGGTGGTG 








TTTCTGTCGAAGCGCA 




n 


IIIIII 


TTTGTGTAAACCTCTC 




1 1 


1 1 1 1 1 1 1 


TTTCTGTGAAGCTCTC 


30 


1 1 




TTTCACCAGGGCCCGC 




1 1 


1 1 1 1 1 1 1 


TTTGGCCAGGTGGACA 




1 1 


lllllll 


TTTGCGGTTCCTG GAG 




1 1 


1 1 1 1 1 1 1 


Till CGAAGCGCGCGT 




IT 


lllllll 


TTTTAACCAGGAGGAG 


35 


1 1 


lllllll 


TTTACGTGGTCGGGTG 




1 ! 


lllllll 


TTTAGGGCCCGCCTGT 




1 1 


lllllll 


TTTGGGCCCGCCTGTC 




1 1 


1 1 1 i 1 1 1 


1 1 1 AACTACGGAGTTG 




m 


lllllll 


TTTGGGGCCGGGCTGT 


40 


rr 


lllllll 


TTTGACCATGTTTCTT 




TT 


lllllll 


1 1 1 C 1 GTGCAGGAACC 




II 


lllllll 


TTTGGCCGGGCTGTTC 




n 


lllllll 


1 1 IACATCCTGGAAGA 




1 1 


lllllll 


1 1 1 CTCACGAGTCCTG 




HLA-A 






TT 


lllllll 


TTTTCAGTCTGTGAGT 




TT 


lllllll 


TTTAGACGCATATGAC 




IT 


lllllll 


TTTGGACGCATATGAC 


50 


II" 




TTTGGTCGCCAGGTCC 


IT 


lllllll 


TTTCCGCAGGCTCTCT 




1 1 


lllllll 


TTTTCCTCCTCCACAT 




1 1 


lllllll 


TI rCCGAACCCTCGTC 




IT 


lllllll 


TTTATTTCTCCACATC 




rr 


lllllll 


TTTGGCGGACATGGCG 


55 


rr 


lllllll 


TTTCCAGAGCGAGGAC 




nr 


lllllll 


1 1 1 1 ICACCACATCCG 




n 


lllllll 


TTTGGGAGCCTGCCCA 




m 


lllllll 


TTTTGATGTGGAGGAG 
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1 1 1 1 1 1 1 1 1 I I I GGAGGAGGAACAG 
I 1 1 1 1 1 1 1 1 1 1 I A GTCATATGCGTC 
I | II I I I 1 1 1 I I GGTCTGCCCGAGC 
I 1 1 1 1 1 1 1 1 1 I I A AACCTGCCATGT 

5 1 1 1 1 1 1 I 1 1 1 1 I CCGGGACACGGAA 
1 1 1 1 1 1 1 1 1 I I I CGTCCTGGGGGGG 
I | I 1 1 II 1 1 I I I CCGCTGCCAGGTC 
I I 1 1 1 1 1 I I I I I A TGCGTCCTGGGG 
1 1 1 1 1 1 1 1 1 1 1 I A TGCGTCTTGGGG 

10 I 1 1 1 1 1 II 1 1 1 1 GGAGAAGAGATAC 
I 1 1 1 1 | | | I 1 1 I GGGAGCCCGCCCA 
I I I 1 1 I I I I I I I CCGCAGGTTCTCT 
I 1 1 1 1 1 1 1 1 II I GCGCAGGTCCTCT 
T i n I II II I I I G GGCGGGCTCTCA 

15 | 1 1 1 1 | | I I I I I C CAGGACACGGAG 
I H I 1 1 | I M I I CCGGCAGTGGAGA 
1 1 1 1 1 I 1 1 I I 1 1A GGAGACAGGGAA 
1 1 1 1 1 I I I I I I I GTCAATCTGTGAG 
I I I I I I I I I I I I A GAAGTGGGTGGC 

20 I I I 1 1 I I I I I I I CAGGTAGGCTCTC 

I 1 1 1 1 I I I I I I I CGGACGCCCCCAA 
1 1 ) 1 1 I I I I I I I I C AATCTGTGAGT 

1 1 1 1 1 1 I 1 1 I 1 1 1 G AAGGCCCAGTC 
I I IIIIII I III CGTCGTAAGCGTC 
25 1 1 1 1 1 I 1 I I I 1 1 A ACCAGAGCGAGG 
1 1 1 1 I I I 1 1 1 1 1 1 GACGGTCATGGC 
1 1 1 1 1 I 1 1 I I 1 1 1 G GACCTGGCGAC 
1 1 1 1 1 1 1 1 1 I 1 1 GAGAGCCCGCCCA 

I I 1 1 1 I I 1 1 I 1 1 IC ATATTCCGTGT 
30 H I I I I I I I I 1 1 GGGAGACACGGAA 

1 1 1 1 1 1 1 1 1 I 1 1 GTCCACTCGGTCA 
1 1 1| 1 1 1 1 1 1 1 1 CCGTGTCTCCCCG 
I 1 1 1 1 I I I I I I I'GCTGCCACGTGGG 
I I II I IIIIIII CGAACTGCGTGTC 

35 I I 1 1 I I I I I I 1 1 GGTAGGCTCTCAA 
1 1 1 1 1 I 1 1 1 I 1 1 A GGTCCACTCGGT 
M I NIMI HI GTCCTGGGGGGGT 
I 1 1 1 I I 1 1 I I I I GCTGCTCCGCCGC 
1 1 1 1 1 1 1 1 I II I GGGGCGCCATGAC 

40 I I 1 1 I I I I I I I I GCGCGATCCGCAG 
1 1 II I III I 1 1 1 GCACATGGCAGGT 
I I 1 1 I I I I I I I I A GGAGAAGAGATA 
I I I 1 1 1 1 1 1 1 1 1 A GGAGCAGAGATA 
1 1 1 1 1 I II 1 1 1 1 CCACTCCACGCAC 

45 1 II I I II I I I I I CCCGTCCACGCAC 
I I N I I 1 1 I I I I CACGTGCCATCCA 
II 1 1 I II I I I I I CCCGGCCCGGCAG 
1 1 1 1 1 I 1 1 1 I 1 1 CACGTCGCAGCCA 
I I II 1 1 I I I 1 1 1 A CGTCGCAGCCAT 

50 1 1 1 1 N II I I 1 1 A CGTGGCAGCCAT 
1 1 1 1 I I I I M 1 1 A TCCAGAGGATGT 
II 1 1 I I 1 1 I I 1 1 CGAGCTCCGTGTC 
1 1 1 1 I I I I I I 1 1 A CCAGAGCGAGGA 
1 1 1 1 1 1 1 1 I I I IA TGAACAGCACGC 

55 1 1 1 1 I N I I I 1 1 rCACACCCTCCAG 
I 1 1 1 I I I I II 1 1 CTACGTGGACAAC 
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HLA-B 



T 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



I I I I I I I I I GGATGGCGCCCCG 
I I I 1 1 1 1 1 I CGGCTCAGATCTC 
1 1 1 1 1 1 1 I CGGGGCGCCGTG 
HUM I CTCCACTGCTCCG 



I I I I I I I I I GTGTTGGTCTTG 



I I I I I I I I GGGTATGACCAGT 



I I 1 1 I 1 1 1 I CCAGGTGATGTA 



IH I II I I GTCCTGCTCCGCC 
I I I I I 1 1 I T GTAGTAGCGGAG 



I I I I 1 1 I GCTCAGGTCCTCC 



l ll i n i I A CCAACACACAGA 



I 1 1 1 1 1 I CCGTCGTAGGCGT 



1 1 1 1 1 1 1 1 GTGAGCCTGCGGA 



I I I I I I I I A CATCATCCAGAG 
I I I I I I I IGGTTCTCTCGGTA 
TGATGTGTCTCTC 



TT 



1 1 1 1 1 I GCGCCATGACCAG 



I I I I I I 1 1 GGCGTCCTGGTCA 



I I I I I 1 1 I A GGAGGACCTGAG 



1 1 1 1 1 1 1 GCGCCAGGCACAG 
I 1 I I I I I A GGAGGGGCCGGA 



1 1 1 1 1 1 1 I CCGCTGCTCCGCC 
1 1 I I I I I I A CACCATCCAGAG 



I I I I I II I CACACAGATCTAC 

I I I I I I I I GGGCATGACCAGT 
1 1 I 1 1 I I I CACACAGATCTCC 

TTTTTGCGAGTGCGTGGA 



n 

I I II II 



TTGGTACCCGCGGA 



I I I I I I I I CCTGTGCGTGGAG 



1 1 1 1 1 1 1 1 A GACACAGATCTT 



I I I I I I I CAGCGACGCCACG 



1 1 I I I 1 1 I CGGGCCGGGACAC 



I I I I 1 1 1 I CCCGTCCCAATAC 



1 1 I I 1 1 1 1 GGGCATAACCAGT 



1 1 I I I II I GCCCCGCTTCATC 



I I I I I I I CAGGAGCGCAGGT 
TTCGTCCACGCACAG 



rrn 

I II 1 1 1 1 I GAGTCCGAGAGAG 
I I I I 1 1 I GACACAGATCTCC 



I I I I I 1 1 1 I A ACCAGTTAGCC 



U I II I III AGGCGTGCTGGT 



TT 



TTTTGACCCTGCTCCGC 
TTTGGGGCTCCGCAGA 



I I 1 I I I I I CCGGTCCCAATAC 

I I I 1 1 1 1 1 GCGGGTCACGGCG 

I I I 1 1 1 1 1 A GGGCCAGGGCTC 



I 1 1 1 1 1 A tCCTCTGGAGGG 



TT 



I I I 1 1 GGCAGACGATGTA 
I I I I I A GGCGGAGCAGGA 



I I I I I I I I CAGCTGCTCCGCC 

I I I 1 1 I 1 1 A TCTGCGGAGCCA 



I II I II I CGGAGCTGTGGTC 



1 1 I I I I I I CGACCACAGCTCC 



I I I I I I I I GAAGAGTTCAGGT 
I I I I I 1 1 1 CATGTCGCAGCCA 



1 1 I 1 1 CTGGGCTGGCTCC 



III I I I I I CAACACACAGACT 
II I I I 1 1 1 I GGCGGAGCAGGA 
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| | 1 1 | I I I 1 1 I I I A TGACCAGGACG 
I | 1 1 | | | | 1 1 I I CCACTGCTCCGCC 
I 1 1 1 I I I 1 1 I I I A TGACCAGGACGC 
I 1 1 1 I I 1 1 I I I I G GAGGGGCCGGAG 

5 1 1 1 1 1 I 1 1 1 1 1 1 1 GCGTGGACGGGC 
1 1 1 1 I I I 1 1 I I I A G ATCTGTATCTC 
I 1 1 1 I I I I I I I I GCGGGTCATGGCG 
I 1 1 I I I I I I I I I CCGGGACATGGCG 
I I 1 1 I I I I 1 1 I I CCACAGCTGTCCA 

10 I M I I I I I I I I I CGGGACATGGCGG 
I 1 1 1 I I 1 1 1 1 I I CCCGTCCACGCAC 
I 1 1 1 I I I II I I I GAAGTGGGAGCCG 
TTTTTTTTTTTTTTCCCAATCCACC 
1 1 1 1 II 1 1 1 1 1 1 CCCACGATGGGGA 

15 I I 1 1 I I I I I M I I I CCCAGTCCACC 
I I 1 1 I I I I I II I GAGATCTGAGCCG 
1 1 1 1 I I I 1 1 II I ICCACGCACTCGC 

I 1 1 1 I I I I I I 1 1 GACAGCG ACGCCA 
| I I I I I I I I I I I CGCCGCGGACACC 

20 1 1 I I I I I I I I 1 1 GTAGGAGGAAGAG 

I I I | I I n I II I CTTTTCCACCTGA 

I 1 1 I I I 1 1 1 I I I CACGTCGCAGCCA 
| | I I I I 1 1 1 I 1 1 CAGGTCGCAGCCA 
I I I IM I IIIII CGTAGCCCACTGC 
25 I 1 1 I I n I I I 1 1 A TCCAGGTGATGT 
TTTTTTTTTTTTTCCCAATCCACCG 
M I I I I 1 1 I I 1 1 GGGCGCTTCCTCC 
T 1 1 I I 1 1 1 1 1 I I CCCGCTTCATCGC 

I 1 1 I I I I 1 1 I 1 1 CCCCGCTTCATCG 
30 1 1 1 1 I I 1 1 1 I 1 1 CACACAGACTTAC 

I I I II I 1 1 I I 1 1 A GGACGGTTCGGG 
1 1 1 II I 1 1 1 II 1 CCCCGAACCGTCC 

I I I | I I I I I I I I GAGCTCTTCCTCG 

I I | | | | | | I | 1 1 GCTCCCGAGAGCA 
35 I I I I I H I II I I A CTCCATGAGGCA 

| | | | | | | 1 1 I II GCTGTGGTGGTGC 
1 1 1 1 I I 1 1 M 1 1 1 1 GTCCAGAAGGC 
1 1 1 1 I I 1 1 1 1 1 1 I GCCCGCGGAGGA 
I 1 1 I I II 1 1 I I I GCCGCGGACAAGG 
40 HI I IIIIHI ICCGCCTTGTCCGC 
1 1 1 1 I I 1 1 1 I I I CGGGTACCACCAG 

HLA-C 

I I I I 1 1 1 1 M II I GAGCTGGGAGCC 
I I I I 1 1 1 1 1 1 1 1 GGTGCAGGGCTCC 

45 1 1 1 I n 1 1 1 1 1 1 GGGTGCAGGGCTC 

I I II 1 1 1 1 I 1 1 I GAGGCGGAGCAGC 
1 1 1 I 1 1 1 1 1 1 1 > A CGGCGGAGCAGC 

I I I I 1 1 1 1 1 1 1 1 GCGGCGGAGCAGC 

I I I I 1 1 1 1 1 1 1 1 A GCGCGCGGAACC 
50 I I I I I IIII I II CGGCCCAGGTCTC 

I | | | I I 1 1 I I II I GGCTCCCAGCTC 

I I I I I I I I I I 1 1 GCGCGCGGAACCC 
| M I I I 1 1 I 1 1 \ A CGGCTTCCATCT 
IHIIII I IIII GGTTCGGGGCTCC 

55 I I I I I I I I I I I I A CTCCACGCACAG 
| | | I I I 1 1 I 1 1 1 I GGAGCAGGAGGG 
| | I I I I I I I I I I GCGCGCAGAACCC 
| | I I I I I I I I I I I GAGTCTCTCATC 
M I I I I I I I I I I CCTGCAGCCCCTC 
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10 



15 
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55 



TTTTT 
I I 1 1 I 



TTTTT 
TTTTT 



TT 



TTTTT 
TTTTT 



TTTTT 
TTTTT 



T 



TT 



TTTTT 
TTTTT 



TTT 



TTTT 
TTTT 



TT 



TTTTT 
TTTTT 



TTTTT 
TTTTT 



TT 



TTT 
TTT 



IT 



TTTTT 
TTTTT 



TTTTT 
TTTTT 



16S rRNA 



TTTTT 
TTTTT 



TTTTT 
TTTTT 



TTTTT 



TTTTT 
TTTTT 



TTTTT 



TTTTT 
TTTTT 



TTTTT 



TTTTT 
TTTTT 



TTTT 



TTTTT 
TTTTT 



TTT 



TTTTT 
TTTTT 



TTTTT 
TTTTT 



TTT 



TTTT 
TTTT 



TTTTT 



TTTTT 



TTTTT 
TTTTT 



TTT 
TTT 



n 



TTCCGCCGTGTCCGC 
TTCCGCTGTGTCCGC 



TTTCCAGAATATGTA 

TTCGGGGAGCCCCGC 

TTGCCGTCGTAGGCG 

TTCCGCCAGGCACAG 

TTGCGCCAGGCACAG 

TTGTAGCCGCGCAGG 

TTGCTGGACGCAGCC 

TTTCCAGTGGATGTA 

TTTCCACGCACAGGC 



TTGCCGTGTCCGCAG 
TTGAGGGGAGCCCCG 
TTCGTGTCCCGGCCT 
TTGGCATGACCAGTT 
TTGGTATGACCAGTT 
TTGACAACCAGGACA 
TTGAATATGTATGGC 
TTGACAGCCAGGACA 
TTCTGGCTGTCCTGG 
CTCCTAGGACAGC 



TTAGGGCCAGGGCTC 

TTTATAACCAGTTCG 

TTCATAGGAGGAAGA 

TTTGTGGAGACCAGG 

TTTGCTCTTCTCCAG 

TTGAAGAATGGGAAG 

TTTGCGGAAACTGCG 



ITTAGCCGCCTGCGT 
TTGGCCGCAAGGCTG 
TTGAACTGCCGTTGA 
TTAGACTGCCGCTGA 



TTTTATTCGGAATTA 
TTTTGCACCCCTTGT 
TTCGCGAGGTTGAGC 



TTTACCCCCCATTGT 
nTCATTTGATACTGG 
TTGTGTGCCTAATAC 
TTTACGACTTAACCC 
TTCCCGGCCTTTGTA 
TTGGGCAAACTGGAG 
TTGATTTGATCCTGG 
TTTGACTCCCGAAGG 
TTGAAGTCGTAGCAA 



TTCGCTGCAGAGATG 
TTTACCCTACCTACT 
TTGAGGACCTTCGGG 
TTAAGGGCCATTACC 
TTGATAAACGCTGGC 
TTGACTAGCTACTCC 
TTACATCCGGTGTTA 
TTATCGCAGGCCTTG 
CACCAAGTCGCT 



TT 

TTTCCCTCCTTTCGG 

TTTTTAAACGCTGGC 

TTCGAAACCGCAAGG 

TTGCAAGCGTCCTCC 

TTACCAAGGACGTTT 



WO 00/65088 PCT/EP00/03636 

-45- 



I 1 1 1 I 1 I I I I I I CTAATACCCGGAG 
I 1 1 1 I I I I I I 1 1 A CTTTCAGTGGGG 
1 1 1 1 1 1 I 1 1 I 1 1 CTGCGTGAAGTCG 
1 1 1 1 1 1 1 I I I 1 1 A ATAGCCCACCAA 
5 1 1 1 I I I I I I I I I AACGGAAACGGGG 
I I I I 1 I I I I I I I G GATTGCACTCTG 
1 1 1 I I I 1 1 1 I 1 1 IA GCCTTGGGGAG 
1 1 1 I I I 1 1 I I 1 1 C GCCGCATGGCTG 
I 1 1 1 I I I I I I I I GCATAAGGGGCAT 

10 I 1 1 I I I I I I I I I I ACCACATCTCTG 
I I I I I I I I I I I I GTTACCGCGAGGA 
I 1 1 I I I I I I I I I GGCTTTCAGAGAT 
I 1 1 I I I I I II 1 1 CGCTGCTTCGCTG 
I 1 1 I I I I I I I I I I A GCGCTACCTTG 

15 1 1 1 I I I II I I 1 1GCAC CACC TGTCA 
I 1 1 I I I I I I I I 1 1 GAGTTTTAACCT 
I I I I I I I I I I I I CTAATACGGGATA 
I I I I I I I I I I I I AGGAGAAAGCTTG 
I 1 1 I I I I I I 1 1 I I I AAGAGATTAGC 

20 I I I I I I I I I I 1 1 GTAGCATTCTGAT 
I I I I I I I I I I I IA GGCTTTCCCCCA 
I I I I I I I I I I I I A GAAGTAGCTTGC 
I 1 1 1 1 1 1 1 1 I 1 1 1 CGCGTATCATCG 
I 1 1 I I I I I I I I I I IC AGAGATTAGC 

25 I I I I I I I I I I I I I CCGAAAGCGTGG 
I I I I I I I I I I I 1 1 A CAACCCGAAGC 
I I I I I I I I II I I I GTCATGGCTCAG 
II 1 1 I I I I I 1 1 1 CGTAGGCTTGGTG 

I I II I I I I I I 1 1 GTGGAATTCCACG 
30 I I I I I I I I I II I ACGGTTCCCGAAG 

I I I II I I I I I I I AACTCGAGTGCGT 
. 1 1 1 I I I 1 1 I 1 1 1 1 GATGTGCTATTA 

I 1 1 I I I I I I 1 1 I AAGCAGGGAGGAA 
I I II I I I I I I I I CTGCTGCAGTGAA 
35 I I I I I I I I I 1 1 I I I GGGATTAGCTC 
I I I I I I I I I I I I CCTTTGATACTGG 
I I I I I I I II I 1 1 GGACGCTAGCGGC 
I I I I I I I I I 1 1 1 GTTTACTACCCAC 

I I I I 1 1 I I I 1 1 1 CGCGATCTCTAGC 
40 1 1 1 I 1 1 I I I 1 1 1 IAGGCCGTTCCCC 

I I I 11 I I I I I I I A CGCGTTGCATCG 
I M I I I I I I I I I GCCCGTCAAGCCA 
1 1 1 I 1 1 1 1 1 1 1 IA GTCCCCGCCATT 
I I I I II I I I I I I CTAGCCGTAAGGG 

45 I I I I I I I I I I I I I GTCCTTCGGGGG 
I I I I I I I I I I I IA ACCAACTCCCAT 
I I I I I I I I I I 1 1 ACTGTGGGTAATA 
I I I I 1 1 I I I 1 1 1 CTGAAAGATGGCG 
I I I I I I I I I I 1 1 CGAAAGCCAGGGG 

50 I I I I I I I I I I 1 1 GTCCGGAATTCTG 
I I I I 1 1 I I I 1 1 1 CAGAAGTGGGTAG 

I I I I I I I I I I 1 1 I CAGTCCTCATGG 

I I I I I I I I I 1 1 I GAAAGAAGCTTGC 
I I I I 1 1 I I I 1 1 1 GACCACCTGTCAC 

55 I I I I I I I II 1 1 1 1 1 1 GGAACTGCAT 

I I I I I I I I I I 1 1 A CAGTTCCCGAAG 

I I I I 1 1 I I I 1 1 1 CTCATATCTCTAC 

I I I I I I I I I I 1 1 1 1 CAGTGAGGAAG 
I I I I I I I I I I 1 1 A CTGTGAGGAAGG 
60 1 1 I I I I I I I I I I CCCAGCCCGTAAG 
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Ti llllll 


TTTTCGTAGCCTTGGTG 




Mini ii 


TTTTATGATGCGTAGCC 




MINIM 


TTTTAGGCAGTGGCTCA 




llllll II 


TTT1 CAGGACTTAACCC 


5 


MINIM 


1 1 1 1 GGCCAGGCCGTAA 




1 1 1 1 1 1 1 1 


TIT 1 1 CCAACTTCGTGC 




1 l l l l l l l 


TTTTGAAGCGTGTGTGA 




llllll II 


1 11 ICICCCCCGAAGGT 




llllll 1 1 


TTTTATGGGAGTTTGTT 


10 


llilllll 


TTTTGTGTGCCGTTACC 




1 1 1 1 1 1 1 1 


TTTTAGCAGTGAGGAAT 




llilllll 


TTTTGCCCCGGTTAACT 




llilllll 


TTTTGCACCGGCAGTCA 




llilllll 


TTTTGGACCTTCCTCTC 


15 


llilllll 


TTTTACCTAGGTGGGAT 




llilllll 


TTTTAATAGCTAATACC 




llilllll 


TTTTGCCATATCTCTAC 




llilllll 


TTTTGCCGGTGGGGTAA 




llilllll 


TTTI 1ACCCCACCTTCG 


20 


llilllll 


TTTTCAAGGCCTGGGAA 




llilllll 


TTTTCAACCCTGGTGGC 




llilllll 


■ H II C 1 AGTCATCCAGT 




llilllll 


TTTTGGCTGCTGCCTCC 




llilllll 


TTTTCCCAGAGCTCAAC 


25 


llilllll 


Til 1 GAAAGCTTGATCC 




llilllll 


mTTAACACGCTGGCAA 




llilllll 


rTTTTGAGCTTGCTCCCC 




mini 


mTTATTTAGTTGAGCA 




1 1 1 1 1 1 1 


ITTTTCGACTTAGGCTCA 


30 


1 1 1 1 1 1 1 


llllll IGATGTGCTATT 




niiii i 


nTTTCTTAGGTGCCAGC 




ii 1 1 1 1 1 


mTTGGCTACAGATCGT 




nun i 


rTTTTAACTTGCGTGCAT 




mini 


mTTGCGATTACGTCAA 


35 


niiii i 


rTTTTGGACGTTGGCGGC 




iiiiiii 


ITTTI I GGTGGAGCATGT 




mini 


mTTATAAACCATGCGG 




IIIIIII 


nTTTAAGAAGTGGGTAG 




iiiiiii 


mTTAACAAGCTAATCC 


40 


iiiiiii 


1 1 1 1 1 1 CCATGGTTTGAC 


- 


iiiiiii 


rTTTTAGTAACTGCCGGT 




IIIIIII 


Mill CAAAAGGGGGCGT 




IIIIIII 


rTTTTGGCGCTTGCGCTC 




IIIIIII 


mTTGCTACCTACGTGC 


45 


IIIIIII 


ITTTT 1 GCGAGGTGGAGC 




IIIIIII 


rTTTTCGCGAGGTGGAGC 




IIIIIII 


nTTTGCTACCTACTTCT 




IIIIIII 


I'll 1 1 1 1 AACACATACAA 




IIIIIII 


IT J II 1 GTTGTGAAATGT 


50 


IIIIIII 


TTTTTCGTAAAACTCAAA 




IIIIIII 


Mill 1 CAAGGGGCAAGT 




IIIIIII 


1 r 1 1 1 1 CCAACCTTGCGG 




IIIIIII 


TTTTTGGAGGAACGTGGG 




IIIIIII 


mn ATAAGCCTCTCAG 


55 


IIIIIII 


Mill 1 ATGCTAATCCCA 




IIIIIII 


TTTTTGATGCTAATCCCA 




IIIIIII 


TTTTTGCCAGTGTTCGTC 




IIIIIII 


T F I TI GTAAAGGTGGGGA 




IIIIIII 


llllll IAACACACCGCC 


60 


IIIIIII 


TTTTTCCAAGGCGGTGAT 
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II I IIH 
I I I I I 1 1 



I I I I I I I 



TTT 
TTT 



TTT 
TTT 



I I I I I I I 

I I l I I I I 
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1 1 1 I 1 1 I 
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mint 
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TGCTACGGCTAACT 
TAGTCGAGCACTCT 
TAAGGGTAGCTAAT 

GTCACAGTACGAG 
TTGAAAGCACTTTA 
TGGCGCAAGGCTTA 
TGCCTAGGTGGGAT 
TGTCCCCACGTTCC 
TGGCCACAAGGGGA 

CTAGCTGTAGGGA 



TGTGGGCAGCAAGC 

TTCGAAAGATTAAA 

TGGAGTATGGTCGC 

TCGAGATGTGAAAG 

TGGGCAGGCTAGAG 

TACCTCCTGAGCCA 

TTCCACCGCTACAC 



TTTTCAGTCTTGCG 

TCTTGACGGGCGGT 

TACGGTAAAAGATG 

TTTCACCCTTGCGG 

TTAACCAGAAAGCC 

TCAACCAGAAAGCC 

TGTGTCAAAGGCAG 

TTAAGTCCGGATTG 

TGCGACATGCTGAT 



TATCAGCCTGCCGC 

TGTCGGTAGGGTAA 

TGTCGGTGGGGTAA 

TCAACTCATAAGGG 

TTTCACTGCTTAAA 

TCGCCAGTCCCACC 

TCTAGTCATAAGGG 



TCACTGATTTGACG 
TGGCCACACAGGGA 



TTTTCCCCCATTGT 
TTGACCAGAAAGGG 
TACACTGGGGGATA 
TTCAGCCGCCTTCG 
TGTCGCCAGCTCGT 
CTCATATGAATTG 
TTGTAAAGGGAGCG 
TCGTAAAGGGAGCG 
TGGCGGCTCCCTCC 
TCAGATGTTCCTCC 
TGTCTCACGACACG 
TTCAGCCGCCTACG 
GTGCTAATACC 



TT 

TCTTGGAACTGCAT 
TAGTACTCACCCGT 
TATTGCTCCATCAG 
TTGATCCTGAGCCA 
TAGCAAGTAGAACG 
TTGCAAGTAGAACG 
TGATAACCGCAAGG 



TGCAAGCGTTTTCC 

TGAATACCTCCTTT 

TACAGAGCTTTACA 

TTGTCCTTCGGGAG 

TAGGCGGCTTGCTG 
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From CFT if available, otherwise greedy algorithm. 
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TGCCCAGGGCACAG 
TCTGTTGTTCTATG 
TAAGGAAAAGGCTC 
TATGAAGATGAGCA 
TCACCCTCAGTGAC 
TGTCAACTTATGCC 
TGCAGGAAGAGGCT 
TTTTGTACAGACGC 
TCGGTCTCCTTCTT 
TGCAATGGGGAGCC 
TTGGATCTGGATAA 
TTGATGAAGATGAG 
TTGTTTGTACAGAC 
TCGTTTGTACAGAC 
CTCAGGCCGCCAA 
TCTCAGGCCACCAA 
TATGTGGATCTGGA 



TACACTCAGGCCGC 

TCACACTCAGGCCG 

TTCAGGCCACCAAC 

TCGTCTGTACAAAC 

TAGAACATCTCATC 

TAGAACTGCTCATC 

TTTGAATTTGATGA 

TTTGAGTTTGATGA 
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TCAACCGGGAGGAG 

TCAACCTGGAGGAG 

TCATCCTGGAGGAG 

TTGCTGGGGGGTCA 

TGGCCTGACGAGGA 

AACTACGAGCTGG 

TTCCAGAGAATTAC 

TTGCCGTAACTGGT 

TTCCAGTACTCCTC 

TAGTGCCGGACAGG 

TACCCCCCAGCAGG 

TAGAGAATTACGTG 

TTCCAGTACTCCGC 

TGCATTCCTGCCGT 

TCGGGAGGAGCTCG 

TCAGCCAGAAGGAC 

TATTGCCGGACAGG 

TCTGCAGCGCCGAG 



TGCGCGTACTCCTC 

TACAGAATTACCTT 

TTTAAGTGTACCAG 

TATCCTGGAGGAGA 

TGGTCATGG GCCCG 

TGGGAGGAGTACGC 

TTGGGGCGGC! CTGA 
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f AAAAG GTAATTCT 
fCTGCCGTAACTGG 
TTTGTGTCTGCATA 
TGGCTGTTCCAGTA 
GTCCCTGGTACAC 
rCCTGCAGCGCCGA 
TTCTTGGAGGGGGA 
TGAGGTCCTTCTGG 
TCAACCGGCAGGAG 
TTGTGTCTGCATAC 
TCGGGAGC.AGTTCG 
TTGACCCTGC-AGCG 
TCAGAGAATTACCT 
TTGGGTAGAAATCC 
TTTACGTGCACCAG 
TCGCTGCAGGGTCA 
TAGCCAGAAGGACA 
TGTTCCAGTAGTCC 
TGGCCTGCTGCGGA 
TTGCAGCGCCGAGG 
TACTACGAGCTGGT 
TCTGGGGCGGCCTG 
TACAGCGACGTGGG 
TTGCCGGACAGGAT 



TCTGQCGTCCCTGG 
TCATGGGCCCGACC 
TGTCCCATTAAACG 
GTAACTGGTACAC 
TAAGGACCTCCTGG 
TCTCCTGGAGGAGA 
TGAGAATTACGTGT 
TCCTGATGAGGTGT 
TCACAGGAGGAGCA 
TTGCCGTCCCTGGT 
TGGGAGGAGTTCGC 
TTGGACAGGAGGAA 



FACCCTGCAGCGTC 
rCCGCCCGGAACTC 
FGCTGCAGGGTCAC 
TTTACAGGACTATCCA 
rGGGTACTCCTGCC 
rCCGTAACTGGTGC 
TGCAG GAATGCTAC 
rCCAGGCAGCATTC 
TAACCGGGAGGAG 
1TGGCCTC.AGGCGGA 



TACTACGAGCTGGG 

TATGAGGTGTACTG 

TATACATCTACAAC 

TTAACTGGTACACT 

TCACGTAATTCTCT 

TAGCATTCCTGCCG 

TACTGGTACACTTA 

TGGCAATGCCCGCT 

TGCTTCGTGCTGGG 

TCGCCCGGAACTCT 

TACAGGACTGTCCA 

TTCCTCCAGGAGGT 

TCCTTCTGGCTGTT 

TGTTCCAGTACTCC 
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TGCGCTGCAGGGTC 

TAACCTGGAGGAGA 

TTTCCTGCCGTAAC 

TACGCTGCAGGGTC 

TCCACAGAATTACC 

TCCAGAGAATTACG 

TCGCCGAGTCCAGC 

TAACAGGCAGGAGT 

TTCCTCCAGGATGT 

TAACCGGCAGGAGT 

TCTCCAGAGAATTA 

TGTTCCAGTACACC 

TCTCC TGTAG GAGA 

TTTACCTTTTCCAG 

TGGAGGAGTTCGTG 

TGAGGAGCTCGTGC 

TGCCGTAACTGGTG 

TGCCCGCTCCTCCT 



TCGTCCCTGGAAAA 

TGCCGTCCCTGGAA 

TCCCCTCCAAGAAG 

TGCTGCCTGGGTAG 

TTCCAGTAGTCCTC 

TATTCCTGCCGTAA 

TCCTGGAAAAGGTA 

TCGTCCCTGGTACA 

TCTCCTCCAGGAAG 

TTCTGATTCTGCCC 

TATCTCCCTGCTGG 

TGAAGGACAACCTG 

TCGTGCACCAGTTA 

TCGGACAGGGTATG 

TCGGACAGGATATG 

TGCACTCGGCGCTG 



TACACGTAATTCTC 
TCGTAACTGGTACA 
TAATGACCCCCCAG 



TTCTCTCCAGGAAG 

TCAGCGACGTGGGA 

TTCCTGCCGGTTGT 

TGAAGGACATCCTG 

TGAAGGACCTCCTG 

TTGTTCCAGTACAC 

TCAGAAGGACAACC 

TGCCTGATGAGGTG 
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TTCACAAGAGGCAAC 

TTCATAAGAGGCAAC 

TTGAACAC.AGGCAAC 

TTACATCCTCATCTG 

TTGAGTGCCCATTGC 

TTCAGCCACAATGTC 

TTACAATCCCAGGGC 

TTACAACCCCAGGGC 

TTGTGGGCATTGTGG 

TTATGGGCATTGTGG 

TTCCAACACCCTCAT 

TTAGACTGTGGTCTG 
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TCCAACATCCTCAT 

TGGCCCACAGACAA 

TCATGGGCATTGTG 

TAACATCCTCATCT 

TCAACACCCTCATT 

TGACTGTGGTCTGC 

TAGCACTGGGGACT 

TCTTAGATTTGACC 

TTTTAGATTTGACC 

TCGATGTTCAAGTT 

TCAATCCCAGGGCG 

TCCTCGGATGATGA 

TTCCACATAGAACT 

TAAATTCATGGGTG 



TCAGCCACAATGCC 
TCACCATAAGAGGC 
TTTCCTCCCTTCTG 
TAACTCTCCTCAG 
TTAAATCTCATCAG 
CTCCTCCCTTCTG 
TGTCAGCCACAATG 
TTCATTCCTTCTTC 
TCTTCCTCCCTTCT 
TATAACTCTCCTCA 
TGAGGCTCATCCAG 
TC.AGGCTTGTCCAG 
TATGTTGACCACAG 
TAGTGCCCACCACA 
TGAACATCCTGATT 
TGGACCTGGAGAAG 
TCCCTCTGGCCAGT 



TCCCTCTGGG R-AGT 

TTTACACCGTAAGA 

TAGAAGATTTGACC 

TGAACTGGCCAGAG 

TGCTACAACTCTAC 

TCAGTCTTACGGTC 

TCAGTCTTATGGTC 
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TATCTTGCAGAGGA 

TGGCTGGGGTGCTC 

TGGGTCACCGCCCG 

TCTGGGGCCGCCTG 

TCTCGGCGCTAGGC 

TGTATCTGGTCACA 

TAACTACGAGGTGG 

TCCAGTACTCGGCG 

TCGGTTATAGATGT 

TGCAAGTCCTGGAG 

TTGGACACAACGCC 

TCTGGG GCTGCCTG 

TGGCCTTAAACTGG 

TTGTGTCTGCATAC 

TGTCGGAAAGGGCT 

TGGGTGTATCGGGT 

TCCAGTACTCGGCA 

TGTAGACATCTCCA 

TAGGAAACGGGCGG 
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TCACACCCCGCACG 
TTCCGCTCGGGTCC 
TAGCATCACCAGGA 
TCCAGTTTAAGGGC 
TATAGCCACAAGGA 
TGTATGCAGACACA 
TTCCAGTACTCGGC 
TAGCGCACGATCTC 
TGGACATCCTGGAG 
TTGGGGCTGCCTGA 
TGTCAGAAAGGGCT 
TCAGGAGCCCTTTC 
TTGTCTCTTCCTGG 



TACACCCCGCACGC 

TTGGTTTCGGAATG 

TAACGGGACAGAGC 

TGCTGGGGCCGCCT 

TGAGGATTTCGTGT 

TGAGAGGAGTACGC 

TCACATCAAAGTCC 

TGCCAGGAGGAGAC 

TGTACTCGG CGGCA 



I I I I I I I CGCCAGTTGTCTC 

TAGGGGGGTGGACA 

TAGATGTATCTGGT 

TTGGGGGAGTTCCG 

TTGTCTCCTCCTGG 

TCACACTCTGTCCA 

TGGAATGATCAGGA 

TATGGGGTCGCCGC 

TCAGATCAAAGTCC 

TAACGGGACCGAGC 

TAGGAGTACGTGCG 

TATGTGACCAGATA 

TAGGGGCGGCCTGT 

TCGCCGGTTGTCTC 

TTGTAACCAGACAC 



TGTGAAGTAGCACA 

TAGCGGCGACCCCA 

TCACACCCTGTCCA 

TGTGTGACCAGATA 

TTGGACCTTCCAGA 

TATCGGGTGGTGAC 

TGTTTAAGGGCCTG 

TTGAAGTAGCACAG 

TGCTCCAACTGGTA 

TCCTTAAACTGGTA 

TAGGAGGACGTGCG 

TTCGTGCTGGGGCT 

TCGCTGCTG GGGCT 

TCCAAGGAAGATCA 

TACCGCGCG GTG AC 

TGCCCTTAAACTGG 

fTGGTCACACCCCG 



TGGGAGTTCCGGGC 
TAGGAGGAGACAAC 
TGGGTGGACACAAC 
CTG CTCGGTGAC 
rTGGGGCGGCTTGA 
■GCGCACGTCCTCC 
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TTAGGACACTCTGGA 
TTGTGTAAACCTCTC 
TTCTGTCGAAGCGCA 
TTGGGGCCGGGCTGT 
CTTCCAGGATGT 
I I I I I I AACTACGGAGTTG 
TTCAAGAAACATGGT 
TTTAACCAGGAGGAG 
TTTGAAGCTCTCCAC 
TTGGGGCGGCCTGTC 
TTGCGGCGCGCGTGT 
I I I I I CTTGGAGCTG 



TTGTACCTGGACAGA 

TTGTTCCTGGAGAGA 

TTACACTCATACTTA 

TTACACTCAGACTTA 

TTTCCTGGAGCAGGC 

TTTCGAAGCGCGCGT 

TTAATCTGCACAGAG 

TTAGGGCCCGCCTGT 



TTTTCTCTTCCTGGC 

TTAACTACGGGGTTG 

TTGTATCTGATCAGG 

TTGGCCAGGTGGACA 

TTGCCCCAGCTCCGT 

TTGGTTCCTGGAGAG 

TTGTCGAAGCGCACG 

TTGTGTCTGCAGTAG 

TTGCTCCACTTGGCA 

TTTACGGGGTTGGTG 

TTCGGTTCCTGCACA 

TTTCCAGTACTCGGC 

TTTGTCCACCTCGGC 

nTCTTCCTGGCCGT 

TTGGTGTCCACCAGG 



TTACTCCGTAGTTGT 

TTCACTCAGACTTAC 

TTGATGCTAGAAACA 

TTGTGGAATGGAGAG 

TTTAACCAAGAGGAG 

TTGTTCCGGAATGGC 

TTGTATCTGCAGTAG 

TTACCTCCTGGTCTG 

TTAGCCAACAGGACT 

TTGCGGTTCCTGCAG 

TTCGCGCCGCGGTGG 

TTGTAAACCTCTCCA 

TTCTGATCAGGCTCC 

1TTCCAGGACTCGGC 

TTAACCATTCACAGA 

TTCGGGCCCTGGTGG 

TTGTTCCGGAACGGC 

TTGCGGCCCGCCTGT 

1TTCCTGGAAGACAC 

ITGCCGGGTGGACAA 
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"CTGCTCCAGGATG 
TCAACTACTGCAGA 
TGTACCTGGAGAGA 
TACCTCTCCACTCC 
TGTGAAGCTCTCCA 
TCCGCGGCGCGCGT 

CTGATCAGGTTCC 
TAATGGGACGGAGC 
TTATGGAAGTATCT 
TTCTGCAGTAGGTG 
TCGGGCCGCGGTGG 
TCTGTGCAGGAACC 
TCCAAGAGGAGGAC 
TCAATTACTGCAGA 
TCACCTACTGCAGA 
TCTGCCTGGATAGA 
TGTAATTGTCCACC 
TCACCAGGGCCCGC 
GCGGTACCTGGA 



T 

TCCTGCAGCACCAC 

TGCGGCGCGCCTGT 

TCCAGGACTCGGCA 

TGACACAACTACGG 

TGATACAACTACGG 

TACTCAGACTTACA 

TTGAGACTTACACA 

TTACGGGGTTGTGG 

TGTAGTTGTCCACC 

TAACCAGGAGGAGT 

TAACCAAGAGGAGT 

TTCCACAGCCCCGT 

TCAGCCAGAAGGAC 

TGGAGGAGTTCCTG 

TGAACTCCTCCTGG 



TAACCACTCACAGA 

TGGCCGGGCTGTTC 

TCTCACGAGTCCTG 

TGTCGAAGCGCAAG 

TCCTCCTGGTCTGT 
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TTI 



TTTCAGTCTGTGAGT 

TTCCGCAGGCTCTCT 

TTATGAGGTATTTCT 

TTGGACATGGAGGTG 

TTC-AGGTAGGCTCTC 

TTTACTCTTGGGGGC 

TTGGTCGCCAGGTCC 

TTGGGAGCCCGCCCA 

TTCCGCTGCTCCGCC 

TTTGAAGGCCCAGTC 

TTGCAGCCATACATC 

TTCCACTCCACGCAC 

TTCACGTCGCAGCCA 

TTGGTCTGCCCGAGC 

TTCAGGTAGACTCTC 

TTGGGAGACACGGAA 



TTCCCGTCCACGCAC 
TTGTCCACTCGGTCA 
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TTT 



TT 
TT 



TTT 
TTT 



TT 



TTT 
TTT 
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TT 



TT 



TTT 
TTT 



TT 



TTT 
TTT 



TT 



TTT 



TTT 
TTT 



TT 



TTT 
TTT 



TTT 



TTT 
TTT 
30 TTT 



TTT 
TTT 



TTT 



TTT 
TTT 



TTT 
TTT 



TTT 
TTT 



TTT 
TTT 



TTT 



TTT 
TTT 



TTT 



TTT 
TTT 



TT 



TTT 
TTT 



TTT 



TTT 
TTT 



TTT 
TTT 



TT 



TT 



60 TTT 



TT 
TT 



TT 
TT 



TT 
TT 



TT 
TT 



TT 



TT 
TT 



TT 



TT 



TT 
TT 



TT 
TT 



TT 



TT 
TT 



TT 



TT 
TT 



TT 



TT 
TT 



TT 
TT 



TT 



TT 
TT 



rrr 

TT 



TT 



TT 
TT 



TT 

TTI 



TT 
TT 



TT 



TT 
TT 



TT 



TT 



TT 
TT 



TATCCAGAGGATGT 
TCGCGATCCGCAGG 
TCCGGGACACGGAA 
TGGAGGAGGAACAG 
TAAGTGAAGGCCCA 
TGGGGCTTGGGGAG 
TCAGACTAACCGAG 
GTCCTGGGGGGGT 
TCGTCGTAAGCGTC 
TAGGTCCACTCGGT 
TGGTAGGCTCTCAA 
TGCGCGATCCGCAG 
TGTGTCCTGGGTCT 
TATCC.AGATAATGT 
TCCGTCGTAGGCGT 
TTCATATTCCGTGT 
TCGGACCCCCCCCA 
TGCCGCATGGACCG 
TGCTGCTCCGCCGC 
TAGCGCAGGTCCTC 



TCTACCTGGATGGC 

TGGTATTTCTTCAC 

TATATGAAGGCCCA 

TCCGTGTCTCCCCG 

TCCGGCAGTGGAGA 

TCGGACGCCCCCAA 

TCCGTGAGGCGGAG 

TAGGAGACAGGGAA 

TAGAGCGAGGACGG 

TGCACATGGCAGGT 

TCAGCTGCTCCGCC 

TATGAACAGCACGC 

TCCCGGCCCGGCAG 

TGCAGCCTGAGAGT 

TGACGGTCATGGC 

TCCGTCGTAAGCGT 

FGAGTATTGGGACC 

rCTGGCCTGGTTCT 

TACCTCATGGAGTG 



TAGCCGCCATGTCC 

TCACGTGCCATCCA 

TGGTCCCCAGGTTC 

TAGGAGAAGACATA 

TCTGCTGCTCCGCC 

TTGACCCAGACCAG 

TCGGGCGGAGCAGT 

TAGGTTCGCTCGGT 

TCATATGCGTCCTG 

TCGTCCTGGGGGGG 

TGCACGTGCGTGGA 

TGGTATTTCTACAC 

TAGGAGCAGAGATA 

TCCCGAACCCTCGT 

TGCCACATGGGCCG 

TAGCAGGAGGAGCC 

TATCCAGATGATGT 

TGGATGGGGAGCAC 

TGC.ACTGGCGCTTC 

TAGCTTGTAAAGTG 

TGATAATGTATGGC 
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1TCACACCCTCCAG 

TCTACGTGGACAAC 

TCGAGCGAACCTGG 

TCGAGAC.AGCCTGC 

TGGGCTACGTGGAC 

TACCACCAGTACGC 

TGAGGATGTATGGC 

TGATCTCAGCCGCC 

TGATCTGAGCTGCC 

TGATGATGTATGGC 

TATACCTGGAGAAC 

TGATGTATGGCTGC 

TTCCGCAGGTTCTC 

TGAGCAGAGATAAA 

TGGGCTGGGAAGAC 

TGATGGGCAGGACT 

TTCACTTTCCCTGT 



TCCCACGATGTGGA 

TAGTCATATGCGTT 

TGGCGGACATGGCG 

TGCTCCGCCTCACG 

TCGTCGTAAGCGTT 

TGATCATGTTTGGC 

TCACGGACGCCCCC 

TGCTCCTCCTGCTC 

TACTCACCGAGTGG 

TAGTCATATGTGTC 

TGGTCTGAGCTGCC 

TTCCCACTTGCGCT 

TGCCCACTCACAGA 

TGGCTCAC.ATCACC 

TGCTCTTGGACCGC 

TGAGAGCCTGCGGA 



TGGAACACACGGAA 
TCGGAACACACGGA 
TCGTAAGCGTCCTG 
TGCCGGTGCGTGGA 
TGCCGCATGGGCCG 
TCCAGAGCGAGGAC 
TCCCAACGGGCCGC 
TCGAGTGCGTGGAG . 
TGCGAACCTGGGGA 
TCGGGTACCAGCGG 
TTGAAGCGGGGCTC 
TGGCGGCCCGTTGG 
CTGGGTC.AGGGC 
TGCCTCATGGGCCG 
TCCATCCCGCTGCC 
TAGCTCAGACCACC 
TGTCGTAAGCGTCC 



TCCCGGCCGCGGGA 
TGGTCCCAATACTC 
TCGTCCCAATACTC 
TGTTCTCACACCAT 
TTCCTCTGGATGGT 
TTCCCACTTGTGCT 
TCCTGACCCAGACC 
TTGAGAGCCCGCCC 
1TGAGTGCGTGGAGT 
ITTACATCATCTGGA 
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TTGATCCGCAGGTTC 

TTTAGAGCAGGAGAG 

TTCCTGGCAGCGGGA 

TTTCATGGAGTGAGA 

TTCCGGCCGCGGGAA 

TTCCAGGACACGGAG 

TTCCGGGACACGGAG 

TTGCAGCCACACATC 

TTGGATGGTGTGAGA 

TTAACATCATCTGGA 

TTTCCTCCTCCACAT 

TTTGGGCGGAGCAGT 

TTTGCAGGGGATGGA 



TTCGCAGGAAGCGCC 
TTGGCCGTCATGGCG 
TTATGCGTCCTGGGG 
TTATGCGTCTTGGGG 
TTTTTCCCTGTCTCC 
CAGGGTGGCCTC 
TTGAGGAGGAACAGC 
TTGCGCAGGGTCGCC 
TTCAGCCAAACATCC 
TTACTTCTGGAAGGT 
TTTCCTCTGGACGGT 
TTGGAGAAGAGATAC 
TTATTCCGTGTCTCC 
TTTCAATCTGTGAGT 
TTGGCCCGTCGGGCG 
TTCGGCGGACATGGC 
TTTACAAGCTGTGAG 
TTCGAACTGCGTGTC 
TTCGAGCTCCGTGTC 
TTACTCCACGCACCG 
CTACGTGGACGAC 
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TTGAGCTGGGAGCC 
TATCACAACAGCCA 
TAGGCTCTCCG CTC 
TGGAGTGGGAGCAG 
TTCACACCCTCCAG 
TACTCCACGCACAG 
TGCCGTCGTAGGCG 
TCGCGCAGAACCCC 
TAGTAGCCGCGCAG 
TGGAGCGGAC.AGCC 
TCAGGTAGGCTCTC 
TGGTTCGGGGCTCC 
TGCCCCAAGCCCTC 
TGGGCATGACCAGT 
TGCGGCTCCGCGGC 
TTCCAGTGGATGTA 
TGGCATGACCAGTT 
CTCACTCGGTCAG 



TCAAGCCCTCCTCC 
TTAGTTTCCGC.AGG 
TCAGGTCGCAGCCA 
TCACTGCGATGAAG 
TGGTATGACCAGTT 
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ACAGCCAGGCCAG 
GAGGCGGAGCAGC 
TGGTTGTAGTAGC 
"ACCTGCGGAAACT 
CGGCCCAGGTCTC 
GCTGGACGCAGCC 
CAGGTTCCGCAGG 
CCGCCAGGCACAG 
CCTCCTACACATC 
ACGGCGGAGCAGC 
AGCGCGCGGAACC 
TTCACTCGGTCAG 
ACGCCGCGAGTCC 
TGGAGCAGGA.GGG 
GGGTATGACCAGT 
ATACCTGGAGAAC 
GGGTTCGGGGCTC 
GACCGCTAGGACA 
ATCTGAGCCGCTG 
CGCGGAGAGCCCC 
CCTGGCGCTTGTA 
CCTGCGGAAACTA 
TAGCGTCTCCTTCC 
TGGCGCCCCGAAC 
ATGATGTGAGACC 
"CTCGGTGTCCTGG 
GTAGTAGCCGCGT 
AGGATGTGAGACC 
GGTAGGCTCTCTG 
AGCGTCTTCTTCC 
CATAGGAGGAAGA 
"GACAACCAGGACA 
GCCGCGGGGAGCC 
GGTGAGGGGCTCT 
CGAGGGGCTGCCA 
GGGTATAACCAGT 
TCCAGAATATGTA 
GGGTGCAGGGCTC 
CGCGCGGAACCCC 
TAGTAGCCGCGTA 
AGCTGCTCTCAGG 
ACCGCACGAACTG 
CCGCAGGCTCACT 
GGTGTGAGACCCG 
TGGAGCCCCGAAC 
AGCCGCGGGAGCC 
ACTGCACGAACTG 
CCGCACGAACTGT 
GGTGCAGGGCTCC 
GCAGCAGGAGC.AG 
TGAGTCTCTCATC 
CCGCCGTGTCCGC 
TCCACGCACAGGC 
ACTCGGTCAGCCT 
CACACC^TCCAGA 
CACACCCTCCAGA 
GCAGCAGGATGAG 
CAGCCACCACAGC 
TCGTGGCTGGCCT 



TACGGCGGAGCAG 
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TTCTCACACCATCCA 

TTTGCGGCGGAGCAG 

TTTCTGAGCCGCCGT 

TTGGCGGAGCAGCAG 

TTCCGCTGCGGACAC 

TTTATAACCAGTTCG 

TTCACATCCTCCAGA 

TTCCGTGTCCGCGGC 

TTCGTGGACGACACA 

TTCCGCTGTGTCCGC 



TTGAAGAATGGGAAG 
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CLAIMS 

5 1 . A method of identifying a set of extendible primers for use in 

the identification, typing or classification of a nucleic acid of known 
sequence having known polymorphisms wherein: 
i) all possible nucleotide sequences of a chosen length of the 

nucleic acid are identified and their corresponding extendible primers, 
io ii) at least one extendible primer is removed from the set 

wherein the at least one primer removed identifies a segment of the nucleic 
acid identified by at least one other primer. 

2. The method of claim 1 , wherein between steps i) and ii): 
1 5 ia) potential extensions for each primer are identified with 

respect to each nucleotide sequence, 

ib) for each extendible primer the identified potential extensions 

are compared to determine which pairs of sequences can be discriminated 
by the primer. 

20 

3. The method of claim 1 or claim 2, wherein a matrix of primers 
and pairs of primer extensions is prepared in binary form and is subjected 
to analysis by a set covering problem (SCP) algorithm. 

25 4. The method of claim 3, wherein a greedy algorithm is used. 

5. The method of claim 3, wherein a CFT algorithm is used 

which involves a Lagrangrian relaxation heuristic. 

30 6. The method of any one of claims 3 to 5, wherein a set of core 

primers is selected as a base for analysis by the SCP algorithm. 
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7. The method of any one of claims 3 to 6, wherein the set of 

extendible primers identified by the SCP algorithm is subjected to a 
redundancy check. 

5 8. A set of extendible primers, for use in the identification, typing 

or classification of a nucleic acid of known sequences having known 
polymorphisms, identified by the method of any one of claims 1 to 7. 

9. The set of extendible primers of claim 8, in the form of an 
10 array. 

10. The set of extendible primers of claim 8 or claim 9, for use in 
the identification, classification or typing of an organism, allele or gene 
selected from class 1 HLA, class 2 HLA and 16S rRNA. 

15 

1 1 . The set of extendible primers of any one of claims 8 to 1 0, 
wherein the primers are arrayed on a surface of a support in such a way 
that recognisable patterns are formed with different types or alleles. 

20 12. A set of extendible primers, for use in the identification, typing 

or classification of a human leucocyte antigen (HLA) gene as indicated, the 
set comprising about the number of primers indicated and being capable of 
distinguishing about the number of alleles indicated: 





HLA gene 


Number of 
Alleles 


Number of 
Primers 


Class I 


HLA- A 


91 


172 




HLA-B 


200 


<1000 




HLA-C 


47 


94 


Class II 


DPA-1 


11 


26 




DPB-1 


74 


130 




DQA-1 


17 


130 




DQB-1 


34 


84 




DRB-1 


192 


<1000 




DRB345 


35 


94 
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13. A set of extendible primers, for use in the identification, typing 

or classification of 16S rRNA, wherein set comprises about 210 primers 
and is capable of distinguishing at least about 1207 different sequences. 

5 14. The set of extendible primers of claim 12 or claim 13, wherein 

the primers have variable segments substantially as set out in appendix 1 
or appendix 2. 

15. A method of identification, typing or classification of a nucleic 
10 acid of known sequence having known polymorphisms, by the use of the 

set of extendible primers as claimed in any one of claims 8 to 14, which 
method comprises applying the nucleic acid or fragments thereof to the set 
of extendible primers under hybridisation conditions, and effecting 
template-directed chain extension of extendible primers that have formed 
is hybrids. 

1 6. The method of claim 1 5, wherein the set of extendible primers 
is provided in the form of an array, and template-directed chain extension is 
effected using labelled chain-terminating nucleotide analogues. 

20 

17. The method of claim 16, wherein template-directed chain 
extension is effected using four different fluorescently-labelled chain 
terminating nucleotide analogues, and the results are analysed by total 
internal reflection fluorescence or confocal microscopy. 

25 

1 8. The method of any one of claims 1 5 to 1 7, wherein the 
nucleic acid is a PCR amplimer. 

1 9. The method of any one of claims 1 5 to 1 8, wherein the 
30 nucleic acid is HLA Class 1 or HLA Class 2 or 16S rRNA or a PCR 

amplimer thereof. 



WO 00/65088 



PCT/EP00/03636 



-63- 

20. The method of any one of claims 1 5 to 19, wherein a 

dUTP/uracil-DNA-glycosylase system is used to break the nucleic acid into 
fragments. 

5 21. A kit for use in the identification, typing or characterisation of 

a nucleic acid of known sequence having known polymorphisms, 
comprising the set of extendible primers as claimed in any one of claims 8 
to 14. 

10 22. The kit of claim 21 , comprising also a pair of primers for 

effecting PCR amplification of the nucleic acid. 

23. An array of sets of extendible primers as claimed in any one 
of claims 8 to 14, for the simultaneous identification typing or classification 

15 of two or more different HLA genes. 

24. A computer readable storage medium having a program 
recorded thereon, wherein the program consists of instructional steps for 
identifying a set of extendible primers for use in the identification, typing or 

20 classification of a nucleic acid of known sequence having known 
polymorphisms, the steps comprising: 

i) identifying all possible nucleotide sequences of a chosen 
length of the nucleic acid and their corresponding extendible primers. 

ii) removing at least one extendible primer from the set wherein 
25 the at least one primer removed identifies a segment of the nucleic acid 

identified by at least one other primer. 
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25. Computer readable program implement consisting of 

instructional steps for identifying a set of extendible primers for use in the 
identification, typing or classification of a nucleic acid of known sequence 
having known polymorphisms, the steps comprising: 
5 i) identifying all possible nucleotide sequences of a chosen 

length of the nucleic acid and their corresponding extendible primers, 
ii) removing at least one extendible primer from the set wherein 

the at least one primer removed identifies a segment of the nucleic acid 
identified by at least one other primer. 



